Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang
A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.
{"title":"LAMDA: Learning-Assisted Multi-stage Autotuning for FPGA Design Closure","authors":"Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang","doi":"10.1109/FCCM.2019.00020","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00020","url":null,"abstract":"A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131678448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey
We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."
{"title":"T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations","authors":"Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey","doi":"10.1109/FCCM.2019.00033","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00033","url":null,"abstract":"We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja\") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures.\"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.
{"title":"A Scalable OpenCL-Based FPGA Accelerator for YOLOv2","authors":"Ke Xu, Xiaoyun Wang, Dong Wang","doi":"10.1109/FCCM.2019.00058","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00058","url":null,"abstract":"This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123585981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
{"title":"FlexGibbs: Reconfigurable Parallel Gibbs Sampling Accelerator for Structured Graphs","authors":"Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei","doi":"10.1109/FCCM.2019.00075","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00075","url":null,"abstract":"Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126597257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.
{"title":"Efficient FPGA Floorplanning for Partial Reconfiguration-Based Applications","authors":"N. Deak, O. Creţ, H. Hedesiu","doi":"10.1109/FCCM.2019.00050","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00050","url":null,"abstract":"This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133904555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.
{"title":"Hybrid XML Parser Based on Software and Hardware Co-design","authors":"Zhe Pan, Xiaohong Jiang, Jian Wu, Xiang Li","doi":"10.1109/FCCM.2019.00066","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00066","url":null,"abstract":"Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133604373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.
{"title":"Fast Voltage Transients on FPGAs: Impact and Mitigation Strategies","authors":"Linda L. Shen, Ibrahim Ahmed, Vaughn Betz","doi":"10.1109/FCCM.2019.00044","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00044","url":null,"abstract":"As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130962208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic
Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.
{"title":"SparseHD: Algorithm-Hardware Co-optimization for Efficient High-Dimensional Computing","authors":"M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic","doi":"10.1109/FCCM.2019.00034","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00034","url":null,"abstract":"Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132499790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We choose a random network of Hodgkin–Huxley (HH) neurons with exponential synaptic conductance as a study of accelerating the simulation of networks of spiking neurons on an FPGA. Focused on the conductance-based HH (COBAHH) benchmark, we execute the benchmark on a general-purpose simulator for spiking neural networks, identify a computationally intensive kernel in the generated C++ code, convert the kernel to a portable OpenCL kernel, and describe the kernel optimizations which can reduce the resource utilizations and improve the kernel performance. We evaluate the kernel on an Intel Arria 10 based FPGA platform, an Intel Xeon 16-core CPU, and an NVIDIA Tesla P100 GPU. FPGAs are promising for the simulation of spiking neuron network.
我们选择了一个具有指数突触电导的霍奇金-赫胥黎(HH)神经元随机网络作为在FPGA上加速尖峰神经元网络模拟的研究。以基于电导的HH (COBAHH)基准测试为重点,在脉冲神经网络的通用模拟器上执行该基准测试,在生成的c++代码中识别计算密集型内核,将内核转换为可移植的OpenCL内核,并描述了可以降低资源利用率和提高内核性能的内核优化。我们在基于Intel Arria 10的FPGA平台、Intel Xeon 16核CPU和NVIDIA Tesla P100 GPU上评估了内核。fpga在脉冲神经元网络的仿真中具有广阔的应用前景。
{"title":"Exploring the Random Network of Hodgkin and Huxley Neurons with Exponential Synaptic Conductances on OpenCL FPGA Platform","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00057","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00057","url":null,"abstract":"We choose a random network of Hodgkin–Huxley (HH) neurons with exponential synaptic conductance as a study of accelerating the simulation of networks of spiking neurons on an FPGA. Focused on the conductance-based HH (COBAHH) benchmark, we execute the benchmark on a general-purpose simulator for spiking neural networks, identify a computationally intensive kernel in the generated C++ code, convert the kernel to a portable OpenCL kernel, and describe the kernel optimizations which can reduce the resource utilizations and improve the kernel performance. We evaluate the kernel on an Intel Arria 10 based FPGA platform, an Intel Xeon 16-core CPU, and an NVIDIA Tesla P100 GPU. FPGAs are promising for the simulation of spiking neuron network.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128333239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}