首页 > 最新文献

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)最新文献

英文 中文
Enabling dynamic and partial reconfiguration in Xilinx SDSoC 在Xilinx SDSoC中实现动态和局部重新配置
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857168
Tobias Kalb, D. Göhringer
In the past years dynamic partial reconfiguration (DPR) has been established as a well-known technique for systems featuring a field programmable gate array (FPGA). Systems-on-Chip (SoC) with an ARM processor ease the utilization of DPR and motivate its implementation to make use of the obvious advantages, such as the reduction of area, power and the acceleration of reconfiguring the FPGA. Nonetheless, the development process for SoCs is still a complex and time consuming task, especially for those designs using DPR. Xilinx counters this complexity with the introduction of their new high-level tools, namely the SDx Development Environment. The SDSoC Development Environment accelerates the development of designs running on Zynq 7000 devices by only using C/C++ applications as input. Unfortunately, this high-level workflow does not incorporate DPR. This paper shows an approach on how to use DPR in Xilinx SDSoC. Thus an application specific design can benefit from both the high-level workflow and the advantages of DPR. We show that our approach to DPR in SDSoC accelerates the overall design time and creates a more efficient embedded application. In our use case the dynamic and partial reconfiguration of hardware accelerators takes 10 ms and the hardware-related section of our embedded application is accelerated by a factor of 14 due to DPR.
在过去的几年里,动态部分重构(DPR)已经成为一种众所周知的技术,用于具有现场可编程门阵列(FPGA)的系统。采用ARM处理器的片上系统(SoC)简化了DPR的使用,并促使其实现,以利用其明显的优势,如减少面积,功耗和加速FPGA的重新配置。尽管如此,soc的开发过程仍然是一个复杂而耗时的任务,特别是对于那些使用DPR的设计。Xilinx通过引入新的高级工具(即SDx开发环境)来解决这种复杂性。SDSoC开发环境通过仅使用C/ c++应用程序作为输入,加速了在Zynq 7000设备上运行的设计的开发。不幸的是,这个高级工作流不包含DPR。本文介绍了如何在赛灵思SDSoC中使用DPR的方法。因此,特定于应用程序的设计可以从高级工作流和DPR的优势中获益。我们证明了我们在SDSoC中的DPR方法可以加快整体设计时间,并创建更高效的嵌入式应用程序。在我们的用例中,硬件加速器的动态和部分重新配置需要10毫秒,嵌入式应用程序中与硬件相关的部分由于DPR而加速了14倍。
{"title":"Enabling dynamic and partial reconfiguration in Xilinx SDSoC","authors":"Tobias Kalb, D. Göhringer","doi":"10.1109/ReConFig.2016.7857168","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857168","url":null,"abstract":"In the past years dynamic partial reconfiguration (DPR) has been established as a well-known technique for systems featuring a field programmable gate array (FPGA). Systems-on-Chip (SoC) with an ARM processor ease the utilization of DPR and motivate its implementation to make use of the obvious advantages, such as the reduction of area, power and the acceleration of reconfiguring the FPGA. Nonetheless, the development process for SoCs is still a complex and time consuming task, especially for those designs using DPR. Xilinx counters this complexity with the introduction of their new high-level tools, namely the SDx Development Environment. The SDSoC Development Environment accelerates the development of designs running on Zynq 7000 devices by only using C/C++ applications as input. Unfortunately, this high-level workflow does not incorporate DPR. This paper shows an approach on how to use DPR in Xilinx SDSoC. Thus an application specific design can benefit from both the high-level workflow and the advantages of DPR. We show that our approach to DPR in SDSoC accelerates the overall design time and creates a more efficient embedded application. In our use case the dynamic and partial reconfiguration of hardware accelerators takes 10 ms and the hardware-related section of our embedded application is accelerated by a factor of 14 due to DPR.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122836019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Data-rate-aware FPGA-based acceleration framework for streaming applications 数据速率感知的基于fpga的流媒体应用加速框架
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857162
Siavash Rezaei, César-Alejandro Hernández-Calderón, S. Mirzamohammadi, E. Bozorgzadeh, A. Veidenbaum, A. Nicolau, M. Prather
In heterogeneous architectures, FPGAs are not only expected to provide higher performance, but also to provide a more energy efficient solution for computationally intensive tasks. While parallelism and pipelining enhance performance on FPGA platforms, the data transfer rate from/to off-chip memory can cause performance degradation. We propose an automated high-level synthesis framework for FPGA-based acceleration of nested loops on large multidimensional input data sets. Given the high-level of parallelism in such applications, our proposed data prefetching algorithm determines the data rate for each parallel datapath. The empirical results on a case study in scientific computing show that FPGA mapping of such nested loops accelerates the application compared to traditional mapping on multicores. The FPGA-accelerated computation results in 3x speedup in runtime and 27x energy-delay-product savings compared to multicore computation.
在异构架构中,fpga不仅要提供更高的性能,还要为计算密集型任务提供更节能的解决方案。虽然并行性和流水线可以增强FPGA平台上的性能,但片外存储器之间的数据传输速率可能会导致性能下降。我们提出了一个自动化的高级合成框架,用于在大型多维输入数据集上基于fpga的嵌套循环加速。考虑到这些应用程序的高级并行性,我们提出的数据预取算法决定了每个并行数据路径的数据速率。在科学计算中的实例研究表明,与传统的多核映射相比,这种嵌套循环的FPGA映射加快了应用程序的速度。与多核计算相比,fpga加速计算的运行速度提高了3倍,能量延迟产品节省了27倍。
{"title":"Data-rate-aware FPGA-based acceleration framework for streaming applications","authors":"Siavash Rezaei, César-Alejandro Hernández-Calderón, S. Mirzamohammadi, E. Bozorgzadeh, A. Veidenbaum, A. Nicolau, M. Prather","doi":"10.1109/ReConFig.2016.7857162","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857162","url":null,"abstract":"In heterogeneous architectures, FPGAs are not only expected to provide higher performance, but also to provide a more energy efficient solution for computationally intensive tasks. While parallelism and pipelining enhance performance on FPGA platforms, the data transfer rate from/to off-chip memory can cause performance degradation. We propose an automated high-level synthesis framework for FPGA-based acceleration of nested loops on large multidimensional input data sets. Given the high-level of parallelism in such applications, our proposed data prefetching algorithm determines the data rate for each parallel datapath. The empirical results on a case study in scientific computing show that FPGA mapping of such nested loops accelerates the application compared to traditional mapping on multicores. The FPGA-accelerated computation results in 3x speedup in runtime and 27x energy-delay-product savings compared to multicore computation.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114870496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
An FPGA implementation of a long short-term memory neural network 长短期记忆神经网络的FPGA实现
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857151
J. Ferreira, Jose Fonseca
Our work proposes a hardware architecture for a Long Short-Term Memory (LSTM) Neural Network, aiming to outperform software implementations, by exploiting its inherent parallelism. The main design decisions are presented, along with the proposed network architecture. A description of the main building blocks of the network is also presented. The network is synthesized for various sizes and platforms, and the performance results are presented and analyzed. Our synthesized network achieves a 251 times speed-up over a custom-built software network, running on an i7–3770k Desktop computer, proving the benefits of parallel computation for this kind of network.
我们的工作提出了一种长短期记忆(LSTM)神经网络的硬件架构,旨在通过利用其固有的并行性来超越软件实现。提出了主要的设计决策,以及提出的网络体系结构。本文还介绍了该网络的主要组成部分。在各种尺寸和平台上对该网络进行了综合,并给出了性能结果并进行了分析。我们的合成网络在i7-3770k桌面计算机上运行时,比定制的软件网络实现了251倍的加速,证明了并行计算对这种网络的好处。
{"title":"An FPGA implementation of a long short-term memory neural network","authors":"J. Ferreira, Jose Fonseca","doi":"10.1109/ReConFig.2016.7857151","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857151","url":null,"abstract":"Our work proposes a hardware architecture for a Long Short-Term Memory (LSTM) Neural Network, aiming to outperform software implementations, by exploiting its inherent parallelism. The main design decisions are presented, along with the proposed network architecture. A description of the main building blocks of the network is also presented. The network is synthesized for various sizes and platforms, and the performance results are presented and analyzed. Our synthesized network achieves a 251 times speed-up over a custom-built software network, running on an i7–3770k Desktop computer, proving the benefits of parallel computation for this kind of network.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121895809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
FPGA architecture for feed-forward sequential memory network targeting long-term time-series forecasting 面向长期时间序列预测的前馈顺序存储网络的FPGA结构
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857169
Kentaro Orimo, Kota Ando, Kodai Ueyoshi, M. Ikebe, T. Asai, M. Motomura
Deep learning is being widely used in various applications, and diverse neural networks have been proposed. A form of neural network, such as the novel feed-forward sequential memory network (FSMN), aims to forecast prospective data by extracting the time-series feature. FSMN is a standard feed-forward neural network equipped with time-domain filters, and it can forecast without recurrent feedback. In this paper, we propose a field-programmable gate-array (FPGA) architecture for this model, and exhibit that the resource does not increase exponentially as the network scale increases.
深度学习被广泛应用于各种应用中,各种各样的神经网络已经被提出。神经网络的一种形式,如新型前馈序列记忆网络(FSMN),旨在通过提取时间序列特征来预测未来数据。FSMN是一种带有时域滤波器的标准前馈神经网络,它可以在没有循环反馈的情况下进行预测。在本文中,我们为该模型提出了一种现场可编程门阵列(FPGA)架构,并证明了资源不会随着网络规模的增加而呈指数级增长。
{"title":"FPGA architecture for feed-forward sequential memory network targeting long-term time-series forecasting","authors":"Kentaro Orimo, Kota Ando, Kodai Ueyoshi, M. Ikebe, T. Asai, M. Motomura","doi":"10.1109/ReConFig.2016.7857169","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857169","url":null,"abstract":"Deep learning is being widely used in various applications, and diverse neural networks have been proposed. A form of neural network, such as the novel feed-forward sequential memory network (FSMN), aims to forecast prospective data by extracting the time-series feature. FSMN is a standard feed-forward neural network equipped with time-domain filters, and it can forecast without recurrent feedback. In this paper, we propose a field-programmable gate-array (FPGA) architecture for this model, and exhibit that the resource does not increase exponentially as the network scale increases.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122129887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
FPGA debugging by a device start and stop approach FPGA调试采用一种器件启动和停止的方法
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857170
Habib ul Hasan Khan, D. Göhringer
This paper presents an FPGA debugging methodology based upon a device start and stop (DSAS) approach. Using this approach, the design starts and stops a device under test (DUT) and saves the data to external memory without human interaction. The presented debugging circuit saves data on a trace buffer and once the trace buffer is full, it stops the DUT, saves the data to external memory through Ethernet and then starts the DUT again. Hence the quantity of the debug data is not limited. The contents stored on the external devices can be viewed by open-source waveform viewers or HDL simulators subsequently. The main benefits of the technique are an unlimited debug window, less use of scarce FPGA resources and no loss of debugging data. Neither an external emulation system nor user intervention is required to save the recorded data once the BRAMs are full.
本文提出了一种基于器件启动和停止(DSAS)方法的FPGA调试方法。使用这种方法,该设计可以启动和停止被测设备(DUT),并将数据保存到外部存储器中,而无需人工交互。该调试电路将数据保存在跟踪缓冲区中,一旦跟踪缓冲区满了,就停止测试,通过以太网将数据保存到外部存储器中,然后重新启动测试。因此,调试数据的数量不受限制。存储在外部设备上的内容可以随后由开源波形查看器或HDL模拟器查看。该技术的主要优点是调试窗口无限,较少使用稀缺的FPGA资源,并且不会丢失调试数据。一旦bram满了,就不需要外部仿真系统或用户干预来保存记录的数据。
{"title":"FPGA debugging by a device start and stop approach","authors":"Habib ul Hasan Khan, D. Göhringer","doi":"10.1109/ReConFig.2016.7857170","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857170","url":null,"abstract":"This paper presents an FPGA debugging methodology based upon a device start and stop (DSAS) approach. Using this approach, the design starts and stops a device under test (DUT) and saves the data to external memory without human interaction. The presented debugging circuit saves data on a trace buffer and once the trace buffer is full, it stops the DUT, saves the data to external memory through Ethernet and then starts the DUT again. Hence the quantity of the debug data is not limited. The contents stored on the external devices can be viewed by open-source waveform viewers or HDL simulators subsequently. The main benefits of the technique are an unlimited debug window, less use of scarce FPGA resources and no loss of debugging data. Neither an external emulation system nor user intervention is required to save the recorded data once the BRAMs are full.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123509294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Solving large systems of linear equations over GF(2) on FPGAs 在fpga上求解GF(2)上的大型线性方程组
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857188
Wen Wang, Jakub Szefer, R. Niederhagen
This paper presents an efficient systolic line architecture for solving large systems of linear equations using Gaussian elimination on the coefficient matrix. Our architecture can also be used for solving matrix inversion problems and for computing the systematic form of matrices. These are common and important computational problems that appear in areas such as cryptography and cryptanalysis. Our architecture solves these problems efficiently for any large-sized matrix over GF(2), regardless of matrix size, shape or density. We implemented and synthesized our design for Altera and Xilinx FPGAs to obtain evaluation data. The results show sub-μs performance for the Gaussian elimination of medium-sized matrices and performance on the order of tens to hundreds of ms for large matrices. In addition, this is one of the first works addressing large-sized matrices of up to 4,000 × 8,000 elements and therefore is suitable for post-quantum cryptographic schemes that require handling such large matrices.
本文提出了一种利用高斯消元法对系数矩阵求解大型线性方程组的有效收缩线结构。我们的体系结构也可以用于解决矩阵反演问题和计算矩阵的系统形式。这些是在密码学和密码分析等领域出现的常见且重要的计算问题。我们的架构有效地解决了GF(2)上任何大尺寸矩阵的这些问题,无论矩阵大小、形状或密度如何。我们在Altera和Xilinx fpga上实现并综合了我们的设计,以获得评估数据。结果表明,中等矩阵的高斯消去性能为亚μs,大矩阵的高斯消去性能为几十到几百ms。此外,这是解决高达4,000 × 8,000个元素的大型矩阵的首批工作之一,因此适用于需要处理如此大矩阵的后量子加密方案。
{"title":"Solving large systems of linear equations over GF(2) on FPGAs","authors":"Wen Wang, Jakub Szefer, R. Niederhagen","doi":"10.1109/ReConFig.2016.7857188","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857188","url":null,"abstract":"This paper presents an efficient systolic line architecture for solving large systems of linear equations using Gaussian elimination on the coefficient matrix. Our architecture can also be used for solving matrix inversion problems and for computing the systematic form of matrices. These are common and important computational problems that appear in areas such as cryptography and cryptanalysis. Our architecture solves these problems efficiently for any large-sized matrix over GF(2), regardless of matrix size, shape or density. We implemented and synthesized our design for Altera and Xilinx FPGAs to obtain evaluation data. The results show sub-μs performance for the Gaussian elimination of medium-sized matrices and performance on the order of tens to hundreds of ms for large matrices. In addition, this is one of the first works addressing large-sized matrices of up to 4,000 × 8,000 elements and therefore is suitable for post-quantum cryptographic schemes that require handling such large matrices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126455758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
An FPGA-optimized architecture of anti-aliasing based super resolution for real-time HDTV to 4K- and 8K-UHD conversions 基于抗混叠的fpga优化超分辨率架构,用于实时HDTV到4K和8K-UHD的转换
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857153
Hotaka Kusano, M. Ikebe, T. Asai, M. Motomura
The demand for light-weight and high-speed super resolution (SR) techniques is growing because super high-resolution displays, such as 4K/8K ultra high definition televisions (UHDTVs), have become common. We here propose an SR method using over up-sampling and anti-aliasing where no iteration process is required — unlike with conventional SR methods. Our method is able to attenuate jaggies in the edge of an enlarged image and does not need to preserve the entire enlarged image. Therefore, this method is suitable for hardware implementation, and the architecture requires five line buffers only (in the memory section). We implemented the proposed method on a field programmable gate array (FPGA) and demonstrated HDTV-to-4K and-8K SR processing in real time (60 frames per second).
随着4K/8K超高清电视(uhdtv)等超高分辨率显示器的普及,对轻量化和高速超分辨率(SR)技术的需求正在增长。我们在这里提出了一种SR方法,使用过上采样和抗混叠,不需要迭代过程-与传统的SR方法不同。我们的方法能够衰减放大图像边缘的锯齿,而不需要保留整个放大图像。因此,这种方法适用于硬件实现,并且该体系结构只需要5行缓冲区(在内存部分)。我们在现场可编程门阵列(FPGA)上实现了所提出的方法,并演示了hdtv到4k和8k SR的实时处理(每秒60帧)。
{"title":"An FPGA-optimized architecture of anti-aliasing based super resolution for real-time HDTV to 4K- and 8K-UHD conversions","authors":"Hotaka Kusano, M. Ikebe, T. Asai, M. Motomura","doi":"10.1109/ReConFig.2016.7857153","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857153","url":null,"abstract":"The demand for light-weight and high-speed super resolution (SR) techniques is growing because super high-resolution displays, such as 4K/8K ultra high definition televisions (UHDTVs), have become common. We here propose an SR method using over up-sampling and anti-aliasing where no iteration process is required — unlike with conventional SR methods. Our method is able to attenuate jaggies in the edge of an enlarged image and does not need to preserve the entire enlarged image. Therefore, this method is suitable for hardware implementation, and the architecture requires five line buffers only (in the memory section). We implemented the proposed method on a field programmable gate array (FPGA) and demonstrated HDTV-to-4K and-8K SR processing in real time (60 frames per second).","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133327556","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Design and implementation of a constant-time FPGA accelerator for fast elliptic curve cryptography 用于快速椭圆曲线密码的恒定时间FPGA加速器的设计与实现
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857163
Atil U. Ay, Erdinç Öztürk, F. Rodríguez-Henríquez, E. Savaş
In this paper we present a scalar multiplication hardware architecture that computes a constant-time variable-base point multiplication over the Galbraith-Lin-Scott (GLS) family of binary elliptic curves. Our hardware design is especially tailored for the quadratic extension field F22n, with n = 127, which allows us to attain a security level close to 128 bits. We explore extensively the usage of digit-based and Karatsuba multipliers for performing the quadratic field arithmetic associated to GLS elliptic curves and report the area and time performance obtained by these two types of multipliers. Targeting a XILINX KINTEX-7 FPGA device, we report a hardware implementation of our design that achieves a delay of just 3.98μs for computing one scalar multiplication. This allows us to claim the current speed record for this operation at or around the 128-bit security level for any hardware or software realization reported in the literature.
在本文中,我们提出了一种标量乘法硬件架构,用于计算二元椭圆曲线(Galbraith-Lin-Scott (GLS))家族上的常时变基点乘法。我们的硬件设计是专门为二次扩展域F22n量身定制的,n = 127,这使我们能够获得接近128位的安全级别。我们广泛探索了基于数字和Karatsuba乘法器的使用,用于执行与GLS椭圆曲线相关的二次场算法,并报告了这两种乘法器获得的面积和时间性能。针对XILINX KINTEX-7 FPGA器件,我们报告了我们设计的硬件实现,计算一个标量乘法的延迟仅为3.98μs。这使我们能够在文献中报告的任何硬件或软件实现的128位安全级别上声称该操作的当前速度记录。
{"title":"Design and implementation of a constant-time FPGA accelerator for fast elliptic curve cryptography","authors":"Atil U. Ay, Erdinç Öztürk, F. Rodríguez-Henríquez, E. Savaş","doi":"10.1109/ReConFig.2016.7857163","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857163","url":null,"abstract":"In this paper we present a scalar multiplication hardware architecture that computes a constant-time variable-base point multiplication over the Galbraith-Lin-Scott (GLS) family of binary elliptic curves. Our hardware design is especially tailored for the quadratic extension field F22n, with n = 127, which allows us to attain a security level close to 128 bits. We explore extensively the usage of digit-based and Karatsuba multipliers for performing the quadratic field arithmetic associated to GLS elliptic curves and report the area and time performance obtained by these two types of multipliers. Targeting a XILINX KINTEX-7 FPGA device, we report a hardware implementation of our design that achieves a delay of just 3.98μs for computing one scalar multiplication. This allows us to claim the current speed record for this operation at or around the 128-bit security level for any hardware or software realization reported in the literature.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133758517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Efficient deep neural network acceleration through FPGA-based batch processing 通过基于fpga的批处理实现高效的深度神经网络加速
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857167
Thorbjörn Posewsky, Daniel Ziener
Deep neural networks are an extremely successful and widely used technique for various pattern recognition and machine learning tasks. Due to power and resource constraints, these computationally intensive networks are difficult to implement in embedded systems. Yet, the number of applications that can benefit from the mentioned possibilities is rapidly rising. In this paper, we propose a novel architecture for processing previously learned and arbitrary deep neural networks on FPGA-based SoCs that is able to overcome these limitations. A key contribution of our approach, which we refer to as batch processing, achieves a mitigation of required weight matrix transfers from external memory by reusing weights across multiple input samples. This technique combined with a sophisticated pipelining and the usage of high performance interfaces accelerates the data processing compared to existing approaches on the same FPGA device by one order of magnitude. Furthermore, we achieve a comparable data throughput as a fully featured x86-based system at only a fraction of its energy consumption.
深度神经网络是一种非常成功和广泛应用于各种模式识别和机器学习任务的技术。由于功率和资源的限制,这些计算密集型网络很难在嵌入式系统中实现。然而,可以从上述可能性中受益的应用程序数量正在迅速增加。在本文中,我们提出了一种新的架构,用于在基于fpga的soc上处理先前学习过的和任意深度神经网络,能够克服这些限制。我们的方法的一个关键贡献,我们称之为批处理,通过在多个输入样本中重用权重,实现了从外部存储器传输所需的权重矩阵的减少。该技术与复杂的流水线和高性能接口的使用相结合,与相同FPGA设备上的现有方法相比,将数据处理速度提高了一个数量级。此外,我们实现了与功能齐全的基于x86的系统相当的数据吞吐量,而能耗仅为其一小部分。
{"title":"Efficient deep neural network acceleration through FPGA-based batch processing","authors":"Thorbjörn Posewsky, Daniel Ziener","doi":"10.1109/ReConFig.2016.7857167","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857167","url":null,"abstract":"Deep neural networks are an extremely successful and widely used technique for various pattern recognition and machine learning tasks. Due to power and resource constraints, these computationally intensive networks are difficult to implement in embedded systems. Yet, the number of applications that can benefit from the mentioned possibilities is rapidly rising. In this paper, we propose a novel architecture for processing previously learned and arbitrary deep neural networks on FPGA-based SoCs that is able to overcome these limitations. A key contribution of our approach, which we refer to as batch processing, achieves a mitigation of required weight matrix transfers from external memory by reusing weights across multiple input samples. This technique combined with a sophisticated pipelining and the usage of high performance interfaces accelerates the data processing compared to existing approaches on the same FPGA device by one order of magnitude. Furthermore, we achieve a comparable data throughput as a fully featured x86-based system at only a fraction of its energy consumption.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114903257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
1 Tb/s anti-replay protection with 20-port on-chip RAM memory in FPGAs 1 Tb/s防重放保护,20端口片上RAM存储器fpga
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857190
B. Buhrow, William J. Goetzinger, B. Gilbert
As network data rates advance toward 1 Tb/s, hardware-based implementations of anti-replay offer desirable tradeoffs over software. However, internal logic busses in FPGAs are becoming wider (512+ bits) and segmented (more than one packet per clock cycle) to accommodate increased network data rates. Such busses are challenging for applications such as anti-replay that require read-modify-write operations to a coherent database on each packet arrival. In this paper we present an FPGA-targeted pipelined anti-replay design capable of accommodating 1024 IPsec tunnels at 1 Tb/s data rate. The novel design is enabled by fast on-chip block RAMs in a xcvu190 Virtex Ultrascale FPGA that are used to construct a 20-port RAM memory operating at 400 MHz with over 5 Tb/s of peak bandwidth. Custom single-clock write-combining techniques are described that accommodate multiple concurrent updates to the same database address. We also investigate the limits of capacity and concurrency for the anti-replay application.
随着网络数据速率向1tb /s发展,基于硬件的反重放实现提供了比软件更理想的折衷。然而,fpga的内部逻辑总线正在变得更宽(512+位)和分段(每个时钟周期超过一个数据包),以适应增加的网络数据速率。这种总线对于反重放等应用程序来说是一个挑战,因为这些应用程序需要在每个数据包到达时对一个一致的数据库进行读-修改-写操作。在本文中,我们提出了一种针对fpga的流水线防重放设计,能够以1tb /s的数据速率容纳1024个IPsec隧道。这种新颖的设计是通过xcvu190 Virtex Ultrascale FPGA中的快速片上块RAM实现的,该FPGA用于构建一个工作频率为400 MHz的20端口RAM存储器,峰值带宽超过5 Tb/s。本文描述了定制的单时钟写入组合技术,该技术可容纳对同一数据库地址的多个并发更新。我们还研究了反重放应用程序的容量和并发性限制。
{"title":"1 Tb/s anti-replay protection with 20-port on-chip RAM memory in FPGAs","authors":"B. Buhrow, William J. Goetzinger, B. Gilbert","doi":"10.1109/ReConFig.2016.7857190","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857190","url":null,"abstract":"As network data rates advance toward 1 Tb/s, hardware-based implementations of anti-replay offer desirable tradeoffs over software. However, internal logic busses in FPGAs are becoming wider (512+ bits) and segmented (more than one packet per clock cycle) to accommodate increased network data rates. Such busses are challenging for applications such as anti-replay that require read-modify-write operations to a coherent database on each packet arrival. In this paper we present an FPGA-targeted pipelined anti-replay design capable of accommodating 1024 IPsec tunnels at 1 Tb/s data rate. The novel design is enabled by fast on-chip block RAMs in a xcvu190 Virtex Ultrascale FPGA that are used to construct a 20-port RAM memory operating at 400 MHz with over 5 Tb/s of peak bandwidth. Custom single-clock write-combining techniques are described that accommodate multiple concurrent updates to the same database address. We also investigate the limits of capacity and concurrency for the anti-replay application.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134599693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1