Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura
A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.
{"title":"A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)","authors":"Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura","doi":"10.1145/3020078.3021782","DOIUrl":"https://doi.org/10.1145/3020078.3021782","url":null,"abstract":"A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124504414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Virtualization and Applications","authors":"J. Lockwood","doi":"10.1145/3257191","DOIUrl":"https://doi.org/10.1145/3257191","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"388 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123264474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Special Session: The Role of FPGAs in Deep Learning","authors":"A. Ling","doi":"10.1145/3257183","DOIUrl":"https://doi.org/10.1145/3257183","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"255 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115783420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Architecture","authors":"S. Wilton","doi":"10.1145/3257186","DOIUrl":"https://doi.org/10.1145/3257186","url":null,"abstract":"","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124889942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In high-performance applications, such as quantum physics and positron emission tomography, precise coincidence detection is of central importance: The quality of the reconstructed images depends on the accuracy with which the underlying system detects the coincidence of two events. This paper explores the utility of three different hardware modules for this very task. In contrast to most of the state-of-the-art systems, these modules are edge triggered rather than being voltage-level based. This change in the modus operandi increases the accuracy of the resulting coincidence window by about one order of magnitude. In addition, this paper considers the entire detector arrays, which host a large number of selected detectors. Due to additional signal propagation delays, these arrays yield a coincidence window width as short as 70 ps within an effective range of up to 10 ns.
{"title":"Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only)","authors":"R. Salomon, R. Joost","doi":"10.1145/3020078.3021766","DOIUrl":"https://doi.org/10.1145/3020078.3021766","url":null,"abstract":"In high-performance applications, such as quantum physics and positron emission tomography, precise coincidence detection is of central importance: The quality of the reconstructed images depends on the accuracy with which the underlying system detects the coincidence of two events. This paper explores the utility of three different hardware modules for this very task. In contrast to most of the state-of-the-art systems, these modules are edge triggered rather than being voltage-level based. This change in the modus operandi increases the accuracy of the resulting coincidence window by about one order of magnitude. In addition, this paper considers the entire detector arrays, which host a large number of selected detectors. Due to additional signal propagation delays, these arrays yield a coincidence window width as short as 70 ps within an effective range of up to 10 ns.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"55 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122848340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.
{"title":"Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search","authors":"Jialiang Zhang, Soroosh Khoram, J. Li","doi":"10.1145/3020078.3021737","DOIUrl":"https://doi.org/10.1145/3020078.3021737","url":null,"abstract":"Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128027376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu
Although there is explosive growth of theoretical research on cognitive radio, the real-time platform for cognitive radio is progressing at a low pace. Researchers expect fast prototyping their designs with appropriate wireless platforms to precisely evaluate and validate their new designs. Platforms for cognitive radio should provide both high-performance and programmability. We observed that for the parallel and reconfigurable nature, FPGA is suitable for developing real-time software-defined radio (SDR) platforms. However, without a carefully designed "middleware architecture layer", Real-time programmable wireless system is still difficult to build. In this paper, we present GRT 2.0, a novel high-performance and programmable SDR platform for cognitive radio. This paper focuses on the architecture design of media access control (MAC) layer and radio frequency (RF) front-end interface. We allocate different MAC functions into different computing units, including a dedicated, light-weight embedded processor and several peripherals, to ensure both programmability and microsecond-level timing requirements. A serial-to-parallel converter is adopted to solve the issues of frame type matching and precise timing between PHY and RF. To support mobile host computers, we use the more portable USB 3.0 interface instead of PCIe. Finally, with the design of an efficient "gain lock" state machine, automatic gain control (AGC) processing time has been reduced to less than 1us. The evaluation result shows that with 802.11a/g protocol, GRT 2.0 achieves maximum throughput of 23Mbps in MAC, which is compatible to commodity fixed-logic wireless network adaptors. The latency of RF front-end is less than 2us, over 10X performance improvement to the Ethernet cable interface. Moreover, by carefully designed "middleware architecture layer" in FPGA, we provide good programmability both in MAC and PHY.
{"title":"GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)","authors":"Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu","doi":"10.1145/3020078.3021798","DOIUrl":"https://doi.org/10.1145/3020078.3021798","url":null,"abstract":"Although there is explosive growth of theoretical research on cognitive radio, the real-time platform for cognitive radio is progressing at a low pace. Researchers expect fast prototyping their designs with appropriate wireless platforms to precisely evaluate and validate their new designs. Platforms for cognitive radio should provide both high-performance and programmability. We observed that for the parallel and reconfigurable nature, FPGA is suitable for developing real-time software-defined radio (SDR) platforms. However, without a carefully designed \"middleware architecture layer\", Real-time programmable wireless system is still difficult to build. In this paper, we present GRT 2.0, a novel high-performance and programmable SDR platform for cognitive radio. This paper focuses on the architecture design of media access control (MAC) layer and radio frequency (RF) front-end interface. We allocate different MAC functions into different computing units, including a dedicated, light-weight embedded processor and several peripherals, to ensure both programmability and microsecond-level timing requirements. A serial-to-parallel converter is adopted to solve the issues of frame type matching and precise timing between PHY and RF. To support mobile host computers, we use the more portable USB 3.0 interface instead of PCIe. Finally, with the design of an efficient \"gain lock\" state machine, automatic gain control (AGC) processing time has been reduced to less than 1us. The evaluation result shows that with 802.11a/g protocol, GRT 2.0 achieves maximum throughput of 23Mbps in MAC, which is compatible to commodity fixed-logic wireless network adaptors. The latency of RF front-end is less than 2us, over 10X performance improvement to the Ethernet cable interface. Moreover, by carefully designed \"middleware architecture layer\" in FPGA, we provide good programmability both in MAC and PHY.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132497663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari
Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier Transform for image reconstruction using the Inverse Fast Fourier Transform (IFFT) algorithm. Though the use of Cartesian trajectories simplifies the IFFT computation, non-Cartesian trajectories have been shown to provide better image resolution with lower scan times. To improve the processing time of MRI image reconstruction for these optimized non-Cartesian trajectories using a Non-uniform Fast Fourier Transform (NuFFT) algorithm, dedicated accelerators are required. We present an FPGA-based MRI solution to implement NuFFT for image reconstruction. The solution is based on the design of an efficient custom accelerator on FPGA using OpenCL, and covers all the phases necessary to reconstruct an image with high accuracy, starting from raw scan data. The architecture can be easily extendable to tackle 3D imaging, and k-space properties have been analyzed to reduce the number of samples processed, achieving satisfactory reconstruction accuracy while positively impacting processing time. Our solution achieves a marked improvement over previously published FPGA- and CPU-based implementations and, due to its scalability, it is suitable for the image sizes common in MRI acquisitions.
{"title":"FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging (Abstract Only)","authors":"Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari","doi":"10.1145/3020078.3021793","DOIUrl":"https://doi.org/10.1145/3020078.3021793","url":null,"abstract":"Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier Transform for image reconstruction using the Inverse Fast Fourier Transform (IFFT) algorithm. Though the use of Cartesian trajectories simplifies the IFFT computation, non-Cartesian trajectories have been shown to provide better image resolution with lower scan times. To improve the processing time of MRI image reconstruction for these optimized non-Cartesian trajectories using a Non-uniform Fast Fourier Transform (NuFFT) algorithm, dedicated accelerators are required. We present an FPGA-based MRI solution to implement NuFFT for image reconstruction. The solution is based on the design of an efficient custom accelerator on FPGA using OpenCL, and covers all the phases necessary to reconstruct an image with high accuracy, starting from raw scan data. The architecture can be easily extendable to tackle 3D imaging, and k-space properties have been analyzed to reduce the number of samples processed, achieving satisfactory reconstruction accuracy while positively impacting processing time. Our solution achieves a marked improvement over previously published FPGA- and CPU-based implementations and, due to its scalability, it is suitable for the image sizes common in MRI acquisitions.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides
Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup.
{"title":"Hardware Synthesis of Weakly Consistent C Concurrency","authors":"Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides","doi":"10.1145/3020078.3021733","DOIUrl":"https://doi.org/10.1145/3020078.3021733","url":null,"abstract":"Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125193547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid growth of data scale, data analysis applications start to meet the performance bottleneck, and thus requiring the aid of hardware acceleration. At the same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability and parallel nature, have gained momentum in the past decade. However, the efficiency of development for acceleration system based on FPGAs is severely constrained by the traditional languages and tools, due to their deficiency in expressibility, extendability, limited libraries and semantic gap between software and hardware design. This paper proposes a new open-source DSL based hardware design framework called VeriScala (https://github.com/VeriScala/VeriScala) that supports highly abstracted object-oriented hardware defining, programmatical testing, and interactive on-chip debugging. By adopting DSL embedded in Scala, we introduce modern software developing concepts into hardware designing including object-oriented programming, parameterized types, type safety, test automation, etc. VeriScala enables designers to describe their hardware designs in Scala, generate Verilog code automatically and interactively debug and test hardware design in real FPGA environment. Through the evaluation on real world applications and usability test, we show that VeriScala provides a practical approach for rapid prototyping of hardware acceleration systems. (This work is supported by the National Key Research & Development Program of China 2016YFB1000500)
{"title":"Scala Based FPGA Design Flow (Abstract Only)","authors":"Yanqiang Liu, Yao Li, Weilun Xiong, Meng Lai, Cheng Chen, Zhengwei Qi, Haibing Guan","doi":"10.1145/3020078.3021762","DOIUrl":"https://doi.org/10.1145/3020078.3021762","url":null,"abstract":"With the rapid growth of data scale, data analysis applications start to meet the performance bottleneck, and thus requiring the aid of hardware acceleration. At the same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability and parallel nature, have gained momentum in the past decade. However, the efficiency of development for acceleration system based on FPGAs is severely constrained by the traditional languages and tools, due to their deficiency in expressibility, extendability, limited libraries and semantic gap between software and hardware design. This paper proposes a new open-source DSL based hardware design framework called VeriScala (https://github.com/VeriScala/VeriScala) that supports highly abstracted object-oriented hardware defining, programmatical testing, and interactive on-chip debugging. By adopting DSL embedded in Scala, we introduce modern software developing concepts into hardware designing including object-oriented programming, parameterized types, type safety, test automation, etc. VeriScala enables designers to describe their hardware designs in Scala, generate Verilog code automatically and interactively debug and test hardware design in real FPGA environment. Through the evaluation on real world applications and usability test, we show that VeriScala provides a practical approach for rapid prototyping of hardware acceleration systems. (This work is supported by the National Key Research & Development Program of China 2016YFB1000500)","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}