首页 > 最新文献

Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文 中文
FPGA-based LSTM Acceleration for Real-Time EEG Signal Processing: (Abstract Only) 基于fpga的LSTM加速实时脑电信号处理
Zhe Chen, Andrew G. Howe, H. T. Blair, J. Cong
Closed-loop neurofeedback is a growing area of research and development for novel therapies to treat brain disorders. A neurofeedback device can detect disease symptoms (such as motor tremors or seizures) in real time from electroencephalogram (EEG) signals, and respond by rapidly delivering neurofeedback stimulation that relieves these symptoms. Conventional EEG processing algorithms rely on acausal filters, which impose delays that can exceed the short feedback latency required for closed-loop stimulation. In this paper, we first introduce a method for causal filtering using long short-term memory (LSTM) networks, which radically reduces the filtering latency. We then propose a reconfigurable architecture that supports time-division multiplexing of LSTM inference engines on a prototype neurofeedback device. We implemented a 128-channel EEG signal processing design on a Zynq-7030 device, and demonstrated its feasibility. Then, we further scaled up the design onto Zynq-7045 and Virtex-690t devices to achieve high performance and energy efficient implementations for massively parallel brain signal processing. We evaluated the performance against optimized implementations on CPU and GPU at the same CMOS technology node. Experiment results show that the Virtex-690t can achieve 1.32x and 11x speed-up against the K40c GPU and the multi-thread Xeon E5-2860 CPU, respectively, while FPGA achieves 6.1x and 26.6x energy efficiency compared to the GPU and CPU.
闭环神经反馈是研究和开发治疗脑部疾病的新疗法的一个日益增长的领域。神经反馈装置可以从脑电图(EEG)信号中实时检测疾病症状(如运动性震颤或癫痫发作),并通过快速传递神经反馈刺激来缓解这些症状。传统的脑电图处理算法依赖于因果滤波器,它施加的延迟可能超过闭环刺激所需的短反馈延迟。在本文中,我们首先介绍了一种使用长短期记忆(LSTM)网络进行因果滤波的方法,该方法从根本上降低了滤波延迟。然后,我们提出了一种可重构架构,该架构支持LSTM推理引擎在原型神经反馈设备上的时分复用。我们在Zynq-7030上实现了128通道脑电信号处理设计,并验证了其可行性。然后,我们进一步将设计扩展到Zynq-7045和Virtex-690t设备上,以实现大规模并行脑信号处理的高性能和节能实现。我们在同一CMOS技术节点上对CPU和GPU的优化实现进行了性能评估。实验结果表明,Virtex-690t相对于K40c GPU和多线程Xeon E5-2860 CPU分别可以实现1.32倍和11倍的速度提升,而FPGA相对于GPU和CPU可以实现6.1倍和26.6倍的能效。
{"title":"FPGA-based LSTM Acceleration for Real-Time EEG Signal Processing: (Abstract Only)","authors":"Zhe Chen, Andrew G. Howe, H. T. Blair, J. Cong","doi":"10.1145/3174243.3174969","DOIUrl":"https://doi.org/10.1145/3174243.3174969","url":null,"abstract":"Closed-loop neurofeedback is a growing area of research and development for novel therapies to treat brain disorders. A neurofeedback device can detect disease symptoms (such as motor tremors or seizures) in real time from electroencephalogram (EEG) signals, and respond by rapidly delivering neurofeedback stimulation that relieves these symptoms. Conventional EEG processing algorithms rely on acausal filters, which impose delays that can exceed the short feedback latency required for closed-loop stimulation. In this paper, we first introduce a method for causal filtering using long short-term memory (LSTM) networks, which radically reduces the filtering latency. We then propose a reconfigurable architecture that supports time-division multiplexing of LSTM inference engines on a prototype neurofeedback device. We implemented a 128-channel EEG signal processing design on a Zynq-7030 device, and demonstrated its feasibility. Then, we further scaled up the design onto Zynq-7045 and Virtex-690t devices to achieve high performance and energy efficient implementations for massively parallel brain signal processing. We evaluated the performance against optimized implementations on CPU and GPU at the same CMOS technology node. Experiment results show that the Virtex-690t can achieve 1.32x and 11x speed-up against the K40c GPU and the multi-thread Xeon E5-2860 CPU, respectively, while FPGA achieves 6.1x and 26.6x energy efficiency compared to the GPU and CPU.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126148450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Mapping Large-Scale DNNs on Asymmetric FPGAs: (Abstract Only) 非对称fpga上大规模dnn的映射(摘要)
Wentai Zhang, Jiaxi Zhang, Minghua Shen, Nong Xiao, Guojie Luo
FPGAs are very attractive to accelerate the deep neural networks (DNNs). While single-FPGA can provide good performance for small-scale DNNs, support for large-scale DNNs is very limited due to they require higher resource demand. In this paper, we propose an efficient mapping approach for accelerating large-scale DNNs on an asymmetric multi-FPGA architecture. Relative to the state-of-the-art single-FPGA resource reuse for large-scale DNNs, we consider multi-FPGA fashion to strive for higher performance. In this fashion, the neural network mapping problem can be formulated as a resource allocation problem, and a dynamic programming-based partitioning is designed to solve this problem optimally. Notice that the network topology and communication bandwidth of multiple FPGAs are always used to guide the partitioning to boost the performance while satisfying the constraints of resource-performance trade-off in a single FPGA. Experimental results using the large-scale ResNet-152 demonstrate that our approach deploys sixteen FPGAs to provide an advantage of 16.4x GOPS over the state-of-the-art work.
fpga在加速深度神经网络(dnn)方面具有很大的吸引力。虽然单fpga可以为小规模dnn提供良好的性能,但由于大规模dnn需要更高的资源需求,因此对大规模dnn的支持非常有限。在本文中,我们提出了一种在非对称多fpga架构上加速大规模dnn的有效映射方法。相对于大规模深度神经网络最先进的单fpga资源重用,我们考虑多fpga方式以争取更高的性能。在这种方式下,神经网络映射问题可以被表述为一个资源分配问题,并设计了一个基于动态规划的分区来最优地解决这个问题。注意,总是使用多个FPGA的网络拓扑和通信带宽来指导分区,以提高性能,同时满足单个FPGA中资源性能权衡的约束。使用大规模ResNet-152的实验结果表明,我们的方法部署了16个fpga,比最先进的工作提供16.4倍的GOPS优势。
{"title":"Mapping Large-Scale DNNs on Asymmetric FPGAs: (Abstract Only)","authors":"Wentai Zhang, Jiaxi Zhang, Minghua Shen, Nong Xiao, Guojie Luo","doi":"10.1145/3174243.3174982","DOIUrl":"https://doi.org/10.1145/3174243.3174982","url":null,"abstract":"FPGAs are very attractive to accelerate the deep neural networks (DNNs). While single-FPGA can provide good performance for small-scale DNNs, support for large-scale DNNs is very limited due to they require higher resource demand. In this paper, we propose an efficient mapping approach for accelerating large-scale DNNs on an asymmetric multi-FPGA architecture. Relative to the state-of-the-art single-FPGA resource reuse for large-scale DNNs, we consider multi-FPGA fashion to strive for higher performance. In this fashion, the neural network mapping problem can be formulated as a resource allocation problem, and a dynamic programming-based partitioning is designed to solve this problem optimally. Notice that the network topology and communication bandwidth of multiple FPGAs are always used to guide the partitioning to boost the performance while satisfying the constraints of resource-performance trade-off in a single FPGA. Experimental results using the large-scale ResNet-152 demonstrate that our approach deploys sixteen FPGAs to provide an advantage of 16.4x GOPS over the state-of-the-art work.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129592558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Session 3: Deep Learning 会议详情:会议3:深度学习
P. Cheung
{"title":"Session details: Session 3: Deep Learning","authors":"P. Cheung","doi":"10.1145/3252938","DOIUrl":"https://doi.org/10.1145/3252938","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127079230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Session 4: High Level Synthesis 1 会话细节:会话4:高级合成1
S. Neuendorffer
{"title":"Session details: Session 4: High Level Synthesis 1","authors":"S. Neuendorffer","doi":"10.1145/3252939","DOIUrl":"https://doi.org/10.1145/3252939","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129224253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Session 5: Applications 1 会话详细信息:会话5:应用
J. Lockwood
{"title":"Session details: Session 5: Applications 1","authors":"J. Lockwood","doi":"10.1145/3252940","DOIUrl":"https://doi.org/10.1145/3252940","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116245456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Liquid Silicon: A Data-Centric Reconfigurable Architecture Enabled by RRAM Technology 液态硅:由RRAM技术支持的以数据为中心的可重构架构
Yue Zha, J. Li
This paper presents a data-centric reconfigurable architecture, namely Liquid Silicon, enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, Liquid Silicon is inherently a homogeneous architecture comprising a two-dimensional (2D) array of identical 'tiles'. Each tile can be configured into one or a combination of four modes: TCAM, logic, interconnect, and memory. Such flexibility allows users to partition resources based on applications? needs, in contrast to the fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its 'memory friendly' architecture effectively addresses the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained logic implementation results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counterpart. Our study shows that, on average, for both traditional and emerging applications, we achieve 62% area reduction, 27% speedup and 31% improvement in energy efficiency when mapping applications onto Liquid Silicon instead of FPGAs.
本文提出了一种以数据为中心的可重构架构,即液态硅,它由新兴的非易失性存储器(即RRAM)实现。与商用fpga的异构架构相比,液态硅本质上是一种由相同“瓦片”组成的二维(2D)阵列的同质架构。每个贴片可以配置成一种或四种模式的组合:TCAM,逻辑,互连和内存。这种灵活性允许用户基于应用程序对资源进行分区。不同于fpga中使用专用硬IP块的固定硬件设计。除了更好的资源利用外,其“内存友好”架构有效地解决了商用fpga的局限性,即片上内存资源稀缺,使其成为fpga的有效补充。此外,与FPGA相比,其粗粒度逻辑实现的逻辑深度更浅,层间路由开销更少,因此面积更小,性能更好。我们的研究表明,平均而言,对于传统和新兴应用,当将应用映射到液态硅而不是fpga上时,我们实现了62%的面积减少,27%的加速和31%的能效提高。
{"title":"Liquid Silicon: A Data-Centric Reconfigurable Architecture Enabled by RRAM Technology","authors":"Yue Zha, J. Li","doi":"10.1145/3174243.3174244","DOIUrl":"https://doi.org/10.1145/3174243.3174244","url":null,"abstract":"This paper presents a data-centric reconfigurable architecture, namely Liquid Silicon, enabled by emerging non-volatile memory, i.e., RRAM. Compared to the heterogeneous architecture of commercial FPGAs, Liquid Silicon is inherently a homogeneous architecture comprising a two-dimensional (2D) array of identical 'tiles'. Each tile can be configured into one or a combination of four modes: TCAM, logic, interconnect, and memory. Such flexibility allows users to partition resources based on applications? needs, in contrast to the fixed hardware design using dedicated hard IP blocks in FPGAs. In addition to better resource usage, its 'memory friendly' architecture effectively addresses the limitations of commercial FPGAs i.e., scarce on-chip memory resources, making it an effective complement to FPGAs. Moreover, its coarse-grained logic implementation results in shallower logic depth, less inter-tile routing overhead, and thus smaller area and better performance, compared with its FPGA counterpart. Our study shows that, on average, for both traditional and emerging applications, we achieve 62% area reduction, 27% speedup and 31% improvement in energy efficiency when mapping applications onto Liquid Silicon instead of FPGAs.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133496412","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Self-adaptation Method of Fitting Convolutional Neural Network into FPGA: Abstract Only) 卷积神经网络在FPGA中的自适应拟合方法
Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang
In recent years, Convolutional Neural Networks (CNNs) have been used widely in many artificial intelligence (AI) related fields. Of many implementation platforms for CNNs, FPGA is regarded as an optimal platform because of its high power-efficiency and flexibility. Although various FPGA accelerators have been proposed to realize CNN, some of them are implemented by High-Level Synthesis such as in OpenCL. This may result in inefficiency in operation performance and resource utilization. Therefore, we propose to parameterize the RTL design at both algorithm and hardware implementation levels. Four types of parallelism are considered to model the parameterized design in terms of the input feature map, the output feature map, the layer and the convolution kernel. Meanwhile a library covering convolution layer, fully-connected layer, pooling layer, control module is established to cater for various CNN models. Further, an algorithm is proposed to find an optimal level of parallelism dedicated to limited resources. As a case study, four typical CNNs are implemented on Stratix III EP3SL110, taking up on-chip memory. Compared with some existing works using the automated design flow, the implementations obtained by the proposed approach have achieved up to 17.13× GOPS. To the best estimate, our design has also achieved 1.33× resource efficiency and 3.61× power efficiency.
近年来,卷积神经网络(cnn)在许多人工智能(AI)相关领域得到了广泛的应用。在众多的cnn实现平台中,FPGA因其高能效和灵活性被认为是最优的平台。虽然已经提出了各种FPGA加速器来实现CNN,但其中一些是通过高级合成(High-Level Synthesis)实现的,例如在OpenCL中。这可能导致操作性能和资源利用率低下。因此,我们建议在算法和硬件实现级别对RTL设计进行参数化。从输入特征映射、输出特征映射、层和卷积核四个方面考虑了四种并行性对参数化设计进行建模。同时建立了包含卷积层、全连接层、池化层、控制模块的库,以适应各种CNN模型。在此基础上,提出了一种针对有限资源的最优并行度算法。作为案例研究,在Stratix III EP3SL110上实现了四个典型的cnn,占用片上内存。与已有的自动化设计流程相比,该方法实现的GOPS可达17.13× GOPS。最乐观的估计,我们的设计也达到了1.33倍的资源效率和3.61倍的功率效率。
{"title":"A Self-adaptation Method of Fitting Convolutional Neural Network into FPGA: Abstract Only)","authors":"Ning Mao, Zhihong Huang, Xing Wei, He Zhao, Xinkai Di, Le Yu, Haigang Yang","doi":"10.1145/3174243.3175003","DOIUrl":"https://doi.org/10.1145/3174243.3175003","url":null,"abstract":"In recent years, Convolutional Neural Networks (CNNs) have been used widely in many artificial intelligence (AI) related fields. Of many implementation platforms for CNNs, FPGA is regarded as an optimal platform because of its high power-efficiency and flexibility. Although various FPGA accelerators have been proposed to realize CNN, some of them are implemented by High-Level Synthesis such as in OpenCL. This may result in inefficiency in operation performance and resource utilization. Therefore, we propose to parameterize the RTL design at both algorithm and hardware implementation levels. Four types of parallelism are considered to model the parameterized design in terms of the input feature map, the output feature map, the layer and the convolution kernel. Meanwhile a library covering convolution layer, fully-connected layer, pooling layer, control module is established to cater for various CNN models. Further, an algorithm is proposed to find an optimal level of parallelism dedicated to limited resources. As a case study, four typical CNNs are implemented on Stratix III EP3SL110, taking up on-chip memory. Compared with some existing works using the automated design flow, the implementations obtained by the proposed approach have achieved up to 17.13× GOPS. To the best estimate, our design has also achieved 1.33× resource efficiency and 3.61× power efficiency.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133365176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Session details: Session 6: High Level Synthesis 2 会话详细信息:会话6:High Level Synthesis 2
G. Constantinides
{"title":"Session details: Session 6: High Level Synthesis 2","authors":"G. Constantinides","doi":"10.1145/3252941","DOIUrl":"https://doi.org/10.1145/3252941","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"178 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114245050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems 用于Intel Broadwell+Arria 10和高带宽FPGA系统的可扩展窗口生成
G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis
Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.
新兴的FPGA系统正在提供更高的外部存储器带宽,以与GPU的性能竞争。然而,由于FPGA通常通过深度管道实现并行性,传统的FPGA设计策略不一定能很好地扩展到可以利用更高带宽的大量复制管道。我们展示了滑动窗口应用程序,数字信号处理的一个重要子集,证明了这种可扩展性问题。我们引入了一个窗口生成器架构,使复制速度超过330 GB/s,比以前的工作提高了8.7倍。我们在英特尔Broadwell+Arria10系统上对窗口生成器进行了2D卷积评估,并表明对于传统的卷积(每张图像一个滤波器),我们的方法在大多数输入尺寸上比12核至强Broadwell E5和高端Nvidia P6000 GPU的性能高出81倍和一个数量级,同时将能量提高15.7倍。对于卷积神经网络(CNN),我们表明,尽管GPU和至强处理器通常优于现有的FPGA系统,但对于许多常见的CNN参数,在带宽足够的FPGA上运行的窗口生成器的投影性能可以优于高端GPU。
{"title":"Scalable Window Generation for the Intel Broadwell+Arria 10 and High-Bandwidth FPGA Systems","authors":"G. Stitt, Abhay Gupta, Madison N. Emas, David Wilson, A. Baylis","doi":"10.1145/3174243.3174262","DOIUrl":"https://doi.org/10.1145/3174243.3174262","url":null,"abstract":"Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115767564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only) FPGA上面向流内核的OpenCL性能优化评估(摘要)
Zheming Jin, H. Finkel
The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.
高效的流应用程序和高级综合(HLS)工具允许没有复杂硬件设计知识的人评估FPGA上的应用程序,有机会和需要了解OpenCL和FPGA在流领域中的作用。为此,我们评估了Nallatech 385A FPGA板上OpenCL基础架构的开销,该板具有Arria 10 GX1150 FPGA。然后,我们探索了实现空间,并讨论了使用OpenCL-to-FPGA HLS工具的流内核性能优化技术。在目标平台上,基础设施开销需要12%的FPGA内存和逻辑资源。单个工作项内核执行的延迟时间为11秒,内核实现的最大频率约为300 MHz。流内核的实验结果表明,在内存带宽约束生效之前,FPGA资源(如块ram和dsp)可以限制内核的性能。内核矢量化和计算单元复制是实用的优化技术,可以将内核性能提高2到10倍。两种技术的结合可以达到最佳性能。为了提高计算单元复制的性能,需要调整本地工作大小,与默认值相比,最优值可以将性能提高3到70倍。
{"title":"Evaluation of OpenCL Performance-oriented Optimizations for Streaming Kernels on the FPGA: (Abstract Only)","authors":"Zheming Jin, H. Finkel","doi":"10.1145/3174243.3174967","DOIUrl":"https://doi.org/10.1145/3174243.3174967","url":null,"abstract":"The streaming applications efficiently and High-level synthesis (HLS) tools allow people without complex hardware design knowledge to evaluate an application on FPGAs, there is an opportunity and a need to understand where OpenCL and FPGA can play in the streaming domains. To this end, we evaluate the overhead of the OpenCL infrastructure on the Nallatech 385A FPGA board that features an Arria 10 GX1150 FPGA. Then we explore the implementation space and discuss the performance optimization techniques for the streaming kernels using the OpenCL-to-FPGA HLS tool. On the target platform, the infrastructure overhead requires 12% of the FPGA memory and logic resources. The latency of the single work-item kernel execution is 11 us and the maximum frequency of a kernel implementation is around 300 MHz. The experimental results of the streaming kernels show FPGA resources, such as block RAMs and DSPs, can limit the kernel performance before the constraint of memory bandwidth takes effect. Kernel vectorization and compute unit duplication are practical optimization techniques that can improve the kernel performance by a factor of 2 to 10. The combination of the two techniques can achieve the best performance. To improve the performance of compute unit duplication, the local work size needs to be tuned and the optimal value can increase the performance by a factor of 3 to 70 compared to the default value.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123574605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1