Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop fully customized deconvolution architecture for CNN-based segmentation algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. Furthermore, a hardware mapping framework is developed to automatically generate the high-throughput hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq-7030 and the deconvolution accelerator achieves a performance of 25.6 GOPS under 200MHz working frequency and a performance density of 0.064 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq-7030 board, and the system achieves a performance of 57.2 GOPS and 0.143 GOPS/DSP using 16-bit quantization, and supports up to 2 frames per second for 512x512 image inputs with a power consumption of only 3.2W.
{"title":"A Low-Power Deconvolutional Accelerator for Convolutional Neural Network Based Segmentation on FPGA: Abstract Only","authors":"Shuanglong Liu, Xinyu Niu, W. Luk","doi":"10.1145/3174243.3174991","DOIUrl":"https://doi.org/10.1145/3174243.3174991","url":null,"abstract":"Convolutional Neural Networks (CNNs) based algorithms have been successful in solving image recognition problems, showing very large accuracy improvement. In recent years, deconvolution layers are widely used as key components in the state-of-the-art CNNs for end-to-end training and models to support tasks such as image segmentation. However, the deconvolution algorithms are computationally intensive which limits their applicability to real time applications. Particularly, there has been little research on the efficient implementations of deconvolution algorithms on FPGA platforms. In this work, we propose and develop fully customized deconvolution architecture for CNN-based segmentation algorithms. Besides, memory sharing between the computation modules is proposed for the FPGA-based CNN accelerator as well as for other optimization techniques. Furthermore, a hardware mapping framework is developed to automatically generate the high-throughput hardware design for any given CNN model on the target device. Finally, we implement our designs on Xilinx Zynq-7030 and the deconvolution accelerator achieves a performance of 25.6 GOPS under 200MHz working frequency and a performance density of 0.064 GOPS/DSP using 32-bit quantization, which significantly outperforms previous designs on FPGAs. A real-time application of scene segmentation on Cityscapes Dataset is used to evaluate our CNN accelerator on Zynq-7030 board, and the system achieves a performance of 57.2 GOPS and 0.143 GOPS/DSP using 16-bit quantization, and supports up to 2 frames per second for 512x512 image inputs with a power consumption of only 3.2W.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2021 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128234782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.
{"title":"An Optimal Microarchitecture for Stencil Computation with Data Reuse and Fine-Grained Parallelism: (Abstract Only)","authors":"Yuze Chi, Peipei Zhou, J. Cong","doi":"10.1145/3174243.3174964","DOIUrl":"https://doi.org/10.1145/3174243.3174964","url":null,"abstract":"Stencil computation is one of the most important kernels for many applications such as image processing, solving partial differential equations, and cellular automata. Nevertheless, implementing a high throughput stencil kernel is not trivial due to its nature of high memory access load and low operational intensity. In this work we adopt data reuse and fine-grained parallelism and present an optimal microarchitecture for stencil computation. The data reuse line buffers not only fully utilize the external memory bandwidth and fully reuse the input data, they also minimize the size of data reuse buffer given the number of fine-grained parallelized and fully pipelined PEs. With the proposed microarchitecture, the number of PEs can be increased to saturate all available off-chip memory bandwidth. We implement this microarchitecture with a high-level synthesis (HLS) based template instead of register transfer level (RTL) specifications, which provides great programmability. To guide the system design, we propose a performance model in addition to detailed model evaluation and optimization analysis. Experimental results from on-board execution show that our design can provide an average of 6.5x speedup over line buffer-only design with only 2.4x resource overhead. Compared with loop transformation-only design, our design can implement a fully pipelined accelerator for applications that cannot be implemented with loop transformation-only due to its high memory conflict and low design flexibility. Furthermore, our FPGA implementation provides 83% throughput of a 14-core CPU with 4x energy-efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134500808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs "wasted" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the "wasted" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8times$ (AlexNet) and $4.9times$ (VGG16) improvement compared with the state-of-the-art implementations.
{"title":"A Framework for Generating High Throughput CNN Implementations on FPGAs","authors":"Hanqing Zeng, Ren Chen, Chi Zhang, V. Prasanna","doi":"10.1145/3174243.3174265","DOIUrl":"https://doi.org/10.1145/3174243.3174265","url":null,"abstract":"We propose a framework to generate highly efficient accelerators for inferencing on FPGAs. Our framework consists of multiple algorithmic optimizations for computation complexity and communication volume reduction, a mapping methodology for efficient resource utilization, and a tool for automatic textttVerilog generation. The algorithmic optimizations improve throughput of frequency domain convolution so as to satisfy a given set of hardware constraints. While the Overlap-and-Add (OaA) technique has been known, it performs \"wasted\" computation at the edges. We propose a novel Concatenate-and-Pad (CaP) technique, which improves OaA significantly by reducing the \"wasted\" computation on the padded pixels. The proposed CaP used in conjunction with OaA enables us to choose a fixed FFT size at design time, and achieve low computation complexity for layers with various image sizes and kernel window sizes. We also develop a novel frequency domain loop tiling technique to further boost throughput by improving data reuse. Our mapping methodology optimizes the architecture for the target device by fast design space exploration. We quantitatively categorize FPGAs by capturing their DSP resources, on-chip memory size and external memory bandwidth into a device coefficient. We identify the optimal architectural parameters based on the tradeoff between computation and communication cost. Our framework includes a tool to automatically generate fully synthesizable textttVerilog. We demonstrate the framework by generating high throughput accelerators for state-of-the-art CNN models on Intel HARP heterogeneous platform. Using our framework, we achieve throughput of $780.6$ $GOPS$, $669.1$ $GOPS$ and $552.1$ $GOPS$ for AlexNet, VGG16 and FCN-16s respectively. These correspond to $6.8times$ (AlexNet) and $4.9times$ (VGG16) improvement compared with the state-of-the-art implementations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130199847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The binary number representation has dominated digital logic for decades due to its compact storage requirements. However, since the number system is positional, it needs to "unpack»» bits, perform computations, and repack the bits back to binary (emphe.g., partial products in multiplication).An alternative representation is the unary number system: we use N bits, out of which the first M are 1 and the rest are 0 to represent the value $M/N$. We present a novel method which first converts binary numbers to unary using thermometer encoders, then uses a "scaling network»» followed by voting gates that we call "alternator logic»», followed by an adder tree to convert the numbers back to the binary format. For monotonically increasing functions, the scaling network is all we need, which essentially uses only the routing resources and flip-flops on the FPGA architecture. Our method is especially well-suited to FPGAs due to the abundant availability of routing and FF resources, and for the ability of FPGAs to realize high fanout gates for highly oscillating functions. We compare our method to stochastic computing and to conventional binary implementations on a number of functions, as well as on two common image processing applications. Our method is clearly superior to the conventional binary implementation: our area×delay cost is on average only 3%, 8% and 32% of the binary method for 8-, 10-, and 12-bit resolutions respectively. Compared to stochastic computing, our cost is 6%, 5%, and 8% for those resolutions. The area cost includes conversions from and to the binary format. Our method out performs the conventional binary method on an edge detection algorithm. However, it is not competitive with the binary method on the median filtering application due to the high cost of generating and saving unary representations of the input pixels.
{"title":"Routing Magic: Performing Computations Using Routing Networks and Voting Logic on Unary Encoded Data","authors":"S. Mohajer, Zhiheng Wang, K. Bazargan","doi":"10.1145/3174243.3174267","DOIUrl":"https://doi.org/10.1145/3174243.3174267","url":null,"abstract":"The binary number representation has dominated digital logic for decades due to its compact storage requirements. However, since the number system is positional, it needs to \"unpack»» bits, perform computations, and repack the bits back to binary (emphe.g., partial products in multiplication).An alternative representation is the unary number system: we use N bits, out of which the first M are 1 and the rest are 0 to represent the value $M/N$. We present a novel method which first converts binary numbers to unary using thermometer encoders, then uses a \"scaling network»» followed by voting gates that we call \"alternator logic»», followed by an adder tree to convert the numbers back to the binary format. For monotonically increasing functions, the scaling network is all we need, which essentially uses only the routing resources and flip-flops on the FPGA architecture. Our method is especially well-suited to FPGAs due to the abundant availability of routing and FF resources, and for the ability of FPGAs to realize high fanout gates for highly oscillating functions. We compare our method to stochastic computing and to conventional binary implementations on a number of functions, as well as on two common image processing applications. Our method is clearly superior to the conventional binary implementation: our area×delay cost is on average only 3%, 8% and 32% of the binary method for 8-, 10-, and 12-bit resolutions respectively. Compared to stochastic computing, our cost is 6%, 5%, and 8% for those resolutions. The area cost includes conversions from and to the binary format. Our method out performs the conventional binary method on an edge detection algorithm. However, it is not competitive with the binary method on the median filtering application due to the high cost of generating and saving unary representations of the input pixels.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130041205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shenghsun Cho, Mrunal Patel, Han Chen, M. Ferdman, Peter Milder
The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure. In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodified application software, operating system, device drivers, and hardware design. Once debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we find that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.
{"title":"A Full-System VM-HDL Co-Simulation Framework for Servers with PCIe-Connected FPGAs","authors":"Shenghsun Cho, Mrunal Patel, Han Chen, M. Ferdman, Peter Milder","doi":"10.1145/3174243.3174269","DOIUrl":"https://doi.org/10.1145/3174243.3174269","url":null,"abstract":"The need for high-performance and low-power acceleration technologies in servers is driving the adoption of PCIe-connected FPGAs in datacenter environments. However, the co-development of the application software, driver, and hardware HDL for server FPGA platforms remains one of the fundamental challenges standing in the way of wide-scale adoption. The FPGA accelerator development process is plagued by a lack of comprehensive full-system simulation tools, unacceptably slow debug iteration times, and limited visibility into the software and hardware at the time of failure. In this work, we develop a framework that pairs a virtual machine and an HDL simulator to enable full-system co-simulation of a server system with a PCIe-connected FPGA. Our framework enables rapid development and debugging of unmodified application software, operating system, device drivers, and hardware design. Once debugged, neither the software nor the hardware requires any changes before being deployed in a production environment. In our case studies, we find that the co-simulation framework greatly improves debug iteration time while providing invaluable visibility into both the software and hardware components.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122789235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce the FGC Toolflow, the only tool providing flexible custom-FPGA generation and configuration to-date. Currently, researchers building custom FPGAs must create for FPGA schematics and bitstreams by hand. Both tasks are prohibitively time intensive and error prone. Additionally, the simulation time for bitcell configuration is very long (often times longer than the functionality), making the verification of FPGA fabrics even more time consuming. Some existing toolflows and software packages designed to help with this process, but they only generate bitcell configurations, leaving schematics to be developed by hand. Others have limitations in circuit-level and architectural parameters, which prevent them from adequately exploring the FPGA design space. The FGC flow is the only flow available that generates a custom full-FPGA schematic from a single parameter text file, and generates the proper configuration bitstream for a target Verilog functionality. The parameter text file can accommodate 100s of different parameters, which include both circuit-level and architectural parameters to fully encompass the FPGA design space. The FGC flow generates both a schematic and a configuration bitstream for an FPGA with 100 CLBs (900,000 transistors) in only 8 minutes. The flow also generates simulation files, allowing the user to quickly set up and perform simulations to verify the FPGA and its configuration at the chip level with SPICE-level accuracy. This flow was used to create, verify, and test a taped-out ultra-low power FPGA.
{"title":"FGC: A Tool-flow for Generating and Configuring Custom FPGAs(Abstract Only)","authors":"Oluseyi A. Ayorinde, He Qi, B. Calhoun","doi":"10.1145/3174243.3174997","DOIUrl":"https://doi.org/10.1145/3174243.3174997","url":null,"abstract":"We introduce the FGC Toolflow, the only tool providing flexible custom-FPGA generation and configuration to-date. Currently, researchers building custom FPGAs must create for FPGA schematics and bitstreams by hand. Both tasks are prohibitively time intensive and error prone. Additionally, the simulation time for bitcell configuration is very long (often times longer than the functionality), making the verification of FPGA fabrics even more time consuming. Some existing toolflows and software packages designed to help with this process, but they only generate bitcell configurations, leaving schematics to be developed by hand. Others have limitations in circuit-level and architectural parameters, which prevent them from adequately exploring the FPGA design space. The FGC flow is the only flow available that generates a custom full-FPGA schematic from a single parameter text file, and generates the proper configuration bitstream for a target Verilog functionality. The parameter text file can accommodate 100s of different parameters, which include both circuit-level and architectural parameters to fully encompass the FPGA design space. The FGC flow generates both a schematic and a configuration bitstream for an FPGA with 100 CLBs (900,000 transistors) in only 8 minutes. The flow also generates simulation files, allowing the user to quickly set up and perform simulations to verify the FPGA and its configuration at the chip level with SPICE-level accuracy. This flow was used to create, verify, and test a taped-out ultra-low power FPGA.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116656839","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe a systolic Field Programmable Gate Array (FPGA) implementation of the Fastfood algorithm that is optimised to run at a high frequency. The Fastfood algorithm supports online learning for large scale kernel methods. Empirical results show that 500 MHz clock rates can be sustained for an architecture that can solve problems with input dimensions that are $10^3$ times larger than previously reported. Unlike many recent deep learning publications, this design implements both training and prediction. This enables the use of kernel methods in applications requiring a rare combination of capacity, adaption and speed.
{"title":"FPGA Fastfood - A High Speed Systolic Implementation of a Large Scale Online Kernel Method","authors":"Sean Fox, D. Boland, P. Leong","doi":"10.1145/3174243.3174271","DOIUrl":"https://doi.org/10.1145/3174243.3174271","url":null,"abstract":"In this paper, we describe a systolic Field Programmable Gate Array (FPGA) implementation of the Fastfood algorithm that is optimised to run at a high frequency. The Fastfood algorithm supports online learning for large scale kernel methods. Empirical results show that 500 MHz clock rates can be sustained for an architecture that can solve problems with input dimensions that are $10^3$ times larger than previously reported. Unlike many recent deep learning publications, this design implements both training and prediction. This enables the use of kernel methods in applications requiring a rare combination of capacity, adaption and speed.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123484421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Serial equivalency can provide easier regression testing and customer support in production-grade CAD software. While existing parallel routing techniques have become sufficiently advanced to accelerate the execution time, support for serial equivalency has been very limited or ignored due to it was considered costly. In this paper, we propose serial-equivalent parallel routing for FPGAs. We use an optimal dependency-aware scheduling to facilitate serial equivalency of parallel routing algorithm. This capability enables the same answer as the serial version of the parallel algorithm, regardless of how many processing cores are used. We also validate this property across different hardware platforms. Further experimental results show that we achieve a 14.27x speedup on the MPI-based distributed parallel computer and a 19.65x speedup on the GPU-based massively parallel machine. To our knowledge, it is the first parallel routing with a serial equivalency guarantee.
{"title":"Towards Serial-Equivalent Parallel Routing for FPGAs: (Abstract Only)","authors":"Minghua Shen, Wentai Zhang, Nong Xiao, Guojie Luo","doi":"10.1145/3174243.3174974","DOIUrl":"https://doi.org/10.1145/3174243.3174974","url":null,"abstract":"Serial equivalency can provide easier regression testing and customer support in production-grade CAD software. While existing parallel routing techniques have become sufficiently advanced to accelerate the execution time, support for serial equivalency has been very limited or ignored due to it was considered costly. In this paper, we propose serial-equivalent parallel routing for FPGAs. We use an optimal dependency-aware scheduling to facilitate serial equivalency of parallel routing algorithm. This capability enables the same answer as the serial version of the parallel algorithm, regardless of how many processing cores are used. We also validate this property across different hardware platforms. Further experimental results show that we achieve a 14.27x speedup on the MPI-based distributed parallel computer and a 19.65x speedup on the GPU-based massively parallel machine. To our knowledge, it is the first parallel routing with a serial equivalency guarantee.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121055740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scale Invariant Feature Transform (SIFT) algorithm is one of the classical feature extraction algorithms that is well known in Computer Vision. It consists of two stages: keypoint descriptor extraction and descriptor matching. SIFT descriptor matching algorithm is a computational intensive process. In this work, we present a design and implementation of a hardware core accelerator for the descriptor-matching algorithm on a field programmable gate array (FPGA). Our proposed hardware core architecture is able to cope with the memory bandwidth and hit the roofline performance model to achieve maximum throughput. The matching-core was implemented using Xilinx Vivado® EDA design suite on a Zynq®-based FPGA Development board. The proposed matching-core architecture is fully pipelined for 16-bit fixed-point operations and consists of five main submodules designed in Verilog, High Level Synthesis, and System Generator. The area resources were significantly reduced compared to the most recent matching-core implemented on hardware. While our proposed hardware accelerator matching-core was able to detect 98% matching-points compared to the software approach, it is 15.7 × faster.
{"title":"SIFT Keypoint Descriptor Matching Algorithm: A Fully Pipelined Accelerator on FPGA(Abstract Only)","authors":"Luka Daoud, M. K. Latif, N. Rafla","doi":"10.1145/3174243.3174994","DOIUrl":"https://doi.org/10.1145/3174243.3174994","url":null,"abstract":"Scale Invariant Feature Transform (SIFT) algorithm is one of the classical feature extraction algorithms that is well known in Computer Vision. It consists of two stages: keypoint descriptor extraction and descriptor matching. SIFT descriptor matching algorithm is a computational intensive process. In this work, we present a design and implementation of a hardware core accelerator for the descriptor-matching algorithm on a field programmable gate array (FPGA). Our proposed hardware core architecture is able to cope with the memory bandwidth and hit the roofline performance model to achieve maximum throughput. The matching-core was implemented using Xilinx Vivado® EDA design suite on a Zynq®-based FPGA Development board. The proposed matching-core architecture is fully pipelined for 16-bit fixed-point operations and consists of five main submodules designed in Verilog, High Level Synthesis, and System Generator. The area resources were significantly reduced compared to the most recent matching-core implemented on hardware. While our proposed hardware accelerator matching-core was able to detect 98% matching-points compared to the software approach, it is 15.7 × faster.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124154467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-Level Synthesis (HLS) has advanced significantly in compiling high-level "soft»» programs into efficient register-transfer level (RTL) "hard»» specifications. However, manually rewriting C-like code is still often required in order to effectively optimize the access performance of synthesized memory subsystems. As such, extensive research has been performed on developing and implementing automated memory optimization techniques, among which memory banking has been a key technique for access performance improvement. However, several key questions remain to be answered: given a stencil-based computing kernel, what constitutes an optimal memory banking scheme that minimizes the number of memory banks required for conflict-free accesses? Furthermore, if such an optimal memory banking scheme exists, how can an FPGA designer automatically determine it? Finally, does any stencil-based kernel have the optimal banking scheme? In this paper we attempt to optimally solve memory banking problem for synthesizing stencil-based computing kernels with well-known theorems in graph theory. Our graph-based methodology not only computes the minimum memory partition factor for any given stencil, but also exploits the repeatability of coloring entire memory access conflict graph, which significantly improves hardware efficiency.
{"title":"Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1145/3174243.3174251","DOIUrl":"https://doi.org/10.1145/3174243.3174251","url":null,"abstract":"High-Level Synthesis (HLS) has advanced significantly in compiling high-level \"soft»» programs into efficient register-transfer level (RTL) \"hard»» specifications. However, manually rewriting C-like code is still often required in order to effectively optimize the access performance of synthesized memory subsystems. As such, extensive research has been performed on developing and implementing automated memory optimization techniques, among which memory banking has been a key technique for access performance improvement. However, several key questions remain to be answered: given a stencil-based computing kernel, what constitutes an optimal memory banking scheme that minimizes the number of memory banks required for conflict-free accesses? Furthermore, if such an optimal memory banking scheme exists, how can an FPGA designer automatically determine it? Finally, does any stencil-based kernel have the optimal banking scheme? In this paper we attempt to optimally solve memory banking problem for synthesizing stencil-based computing kernels with well-known theorems in graph theory. Our graph-based methodology not only computes the minimum memory partition factor for any given stencil, but also exploits the repeatability of coloring entire memory access conflict graph, which significantly improves hardware efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"153 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130637483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}