首页 > 最新文献

2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)最新文献

英文 中文
A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC 在中档全可编程SoC上用于CNN加速的高效运行时可重构IP
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857144
P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini
Convolutional Neural Networks (CNNs) are a nature-inspired model, extensively employed in a broad range of applications in computer vision, machine learning and pattern recognition. The CNN algorithm requires execution of multiple layers, commonly called convolution layers, that involve application of 2D convolution filters of different sizes over a set of input image features. Such a computation kernel is intrinsically parallel, thus significantly benefits from acceleration on parallel hardware. In this work, we propose an accelerator architecture, suitable to be implemented on mid-to high-range FPGA devices, that can be re-configured at runtime to adapt to different filter sizes in different convolution layers. We present an accelerator configuration, mapped on a Xilinx Zynq XC-Z7045 device, that achieves up to 120 GMAC/s (16 bit precision) when executing 5×5 filters and up to 129 GMAC/s when executing 3×3 filters, consuming less than 10W of power, reaching more than 97% DSP resource utilizazion at 150MHz operating frequency and requiring only 16B/cycle I/O bandwidth.
卷积神经网络(cnn)是一种受自然启发的模型,广泛应用于计算机视觉、机器学习和模式识别等领域。CNN算法需要执行多层,通常称为卷积层,涉及在一组输入图像特征上应用不同大小的二维卷积滤波器。这种计算内核本质上是并行的,因此从并行硬件上的加速中获益良多。在这项工作中,我们提出了一种适合在中高档FPGA器件上实现的加速器架构,该架构可以在运行时重新配置以适应不同卷积层中的不同滤波器尺寸。我们提出了一种加速器配置,映射到Xilinx Zynq XC-Z7045器件上,在执行5×5滤波器时达到120 GMAC/s(16位精度),在执行3×3滤波器时达到129 GMAC/s,消耗不到10W的功率,在150MHz工作频率下达到97%以上的DSP资源利用率,只需要16B/周期I/O带宽。
{"title":"A high-efficiency runtime reconfigurable IP for CNN acceleration on a mid-range all-programmable SoC","authors":"P. Meloni, Gianfranco Deriu, Francesco Conti, Igor Loi, L. Raffo, L. Benini","doi":"10.1109/ReConFig.2016.7857144","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857144","url":null,"abstract":"Convolutional Neural Networks (CNNs) are a nature-inspired model, extensively employed in a broad range of applications in computer vision, machine learning and pattern recognition. The CNN algorithm requires execution of multiple layers, commonly called convolution layers, that involve application of 2D convolution filters of different sizes over a set of input image features. Such a computation kernel is intrinsically parallel, thus significantly benefits from acceleration on parallel hardware. In this work, we propose an accelerator architecture, suitable to be implemented on mid-to high-range FPGA devices, that can be re-configured at runtime to adapt to different filter sizes in different convolution layers. We present an accelerator configuration, mapped on a Xilinx Zynq XC-Z7045 device, that achieves up to 120 GMAC/s (16 bit precision) when executing 5×5 filters and up to 129 GMAC/s when executing 3×3 filters, consuming less than 10W of power, reaching more than 97% DSP resource utilizazion at 150MHz operating frequency and requiring only 16B/cycle I/O bandwidth.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125026773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A novel and efficient method to initialize FPGA embedded memory content in asymptotically constant time 在渐近常数时间内初始化FPGA嵌入式存储器内容的一种新颖有效的方法
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857146
Matěj Bartík, S. Ubik, P. Kubalík
This paper describes analysis and implementation of a new method for maintaining valid content of FPGA memory blocks with an asymptotically constant time synchronous clear ability, that can be useful for (re)initialization to one default value. A particular application can be for high-speed real-time LZ77 [1] lossless compression algorithms, where a dictionary has to be (re)initialized before each run of the implemented compression algorithm. The method is based on two most widely used techniques for clearing the memory content: a linear passage of the memory and clearing each cell by writing a default value and creating a register field providing an (in)valid bit for each memory cell. Our solution combines these two techniques together with the use of FPGA distributed memory blocks implemented in LUTs (Look-Up Tables) to overcome negative features of each previous method without losing the most of positive features. Our solution provides a balance between the two previous techniques and exceeds them in speed, resources utilization and latency of (re)initialization.
本文描述了一种新的方法的分析和实现,用于维护FPGA内存块的有效内容,该方法具有渐近常数时间同步清除能力,可用于(重新)初始化为一个默认值。一个特殊的应用程序可以是高速实时LZ77[1]无损压缩算法,其中字典必须在每次运行实现的压缩算法之前(重新)初始化。该方法基于两种最广泛使用的清除内存内容的技术:内存的线性通道和通过写入默认值和创建一个为每个内存单元提供(in)有效位的寄存器字段来清除每个单元。我们的解决方案将这两种技术与在lut(查找表)中实现的FPGA分布式内存块的使用结合在一起,以克服每种先前方法的负面特征,而不会失去大多数积极特征。我们的解决方案在前两种技术之间提供了一种平衡,并且在速度、资源利用率和(重新)初始化延迟方面超过了它们。
{"title":"A novel and efficient method to initialize FPGA embedded memory content in asymptotically constant time","authors":"Matěj Bartík, S. Ubik, P. Kubalík","doi":"10.1109/ReConFig.2016.7857146","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857146","url":null,"abstract":"This paper describes analysis and implementation of a new method for maintaining valid content of FPGA memory blocks with an asymptotically constant time synchronous clear ability, that can be useful for (re)initialization to one default value. A particular application can be for high-speed real-time LZ77 [1] lossless compression algorithms, where a dictionary has to be (re)initialized before each run of the implemented compression algorithm. The method is based on two most widely used techniques for clearing the memory content: a linear passage of the memory and clearing each cell by writing a default value and creating a register field providing an (in)valid bit for each memory cell. Our solution combines these two techniques together with the use of FPGA distributed memory blocks implemented in LUTs (Look-Up Tables) to overcome negative features of each previous method without losing the most of positive features. Our solution provides a balance between the two previous techniques and exceeds them in speed, resources utilization and latency of (re)initialization.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129527137","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
An FPGA-based design for joint control and monitoring of permanent magnet synchronous motors 基于fpga的永磁同步电机联合控制与监测设计
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857152
Paul Rogers, R. Kavasseri, Scott C. Smith
We present an FPGA-based design approach that will allow motor control and on-board condition monitoring to be achieved in parallel using the same set of physical variables. The key idea lies in exploiting the parallelism of FPGAs to achieve the joint objectives of control and monitoring using common physical variables, namely motor phase currents. For illustration, a permanent magnet synchronous machine (PMSM) governed by Field Oriented Control (FOC) and health monitoring using Motor Current Signature Analysis (MCSA) is considered. Since FOC is computationally intensive, the control algorithm is optimized for speed using distributed pipelining and an FFT-core is utilized for MCSA. The design stage uses MATLAB/Simulink for the reference control coupled with HDL coder for VHDL generation. The synthesis and timing analysis are done with Altera's Quartus II. The tool-chain allows easy analysis and optimization of model-based motor control algorithms. The results show that the joint objectives can be obtained with decreased current and torque ripple at the expense of modest increases in resource utilization.
我们提出了一种基于fpga的设计方法,该方法允许使用同一组物理变量并行实现电机控制和车载状态监测。关键思想在于利用fpga的并行性来实现控制和监测的共同目标,使用共同的物理变量,即电机相电流。以磁场定向控制(FOC)和基于电机电流特征分析(MCSA)的健康监测的永磁同步电机(PMSM)为例进行了分析。由于FOC是计算密集型的,因此使用分布式流水线对控制算法进行了速度优化,并在MCSA中使用了fft核心。设计阶段采用MATLAB/Simulink作为参考控制,并用HDL编码器进行VHDL生成。合成和时序分析是用Altera的Quartus II完成的。工具链允许易于分析和优化基于模型的电机控制算法。结果表明,在减小电流和转矩脉动的前提下,以适度提高资源利用率为代价,可以达到联合目标。
{"title":"An FPGA-based design for joint control and monitoring of permanent magnet synchronous motors","authors":"Paul Rogers, R. Kavasseri, Scott C. Smith","doi":"10.1109/ReConFig.2016.7857152","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857152","url":null,"abstract":"We present an FPGA-based design approach that will allow motor control and on-board condition monitoring to be achieved in parallel using the same set of physical variables. The key idea lies in exploiting the parallelism of FPGAs to achieve the joint objectives of control and monitoring using common physical variables, namely motor phase currents. For illustration, a permanent magnet synchronous machine (PMSM) governed by Field Oriented Control (FOC) and health monitoring using Motor Current Signature Analysis (MCSA) is considered. Since FOC is computationally intensive, the control algorithm is optimized for speed using distributed pipelining and an FFT-core is utilized for MCSA. The design stage uses MATLAB/Simulink for the reference control coupled with HDL coder for VHDL generation. The synthesis and timing analysis are done with Altera's Quartus II. The tool-chain allows easy analysis and optimization of model-based motor control algorithms. The results show that the joint objectives can be obtained with decreased current and torque ripple at the expense of modest increases in resource utilization.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129214969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
High-level synthesis of a genomic database search engine 基因组数据库搜索引擎的高级合成
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857174
Rasha Karakchi, Jordan A. Bradshaw, J. Bakos
Genomic database search is an I/O-bound problem, so avoiding unnecessary I/O transactions is a key consideration for improving search throughput. Many approximate search tools such as NCBI BLAST perform a database scan for each query, lacking a mechanism to avoid access to portions of the database that offer no potential for a match. In this paper we present an approach for using an FPGA-based pattern filter to convert each search query into a set of potential database matches that reduces the average portion of the database accessed per query. The approach is based on a hardware design for a pattern filter that can achieve a sustained recognition rate of one pattern per cycle. We used Vivado HLS to design the filter. Despite the presence of loop-carried dependencies, the final design meets the maximum possible throughout as constrained by the code's arithmetic intensity and available memory bandwidth. In this paper we describe the filter implementation and our code tuning methodology.
基因组数据库搜索是一个I/O受限问题,因此避免不必要的I/O事务是提高搜索吞吐量的关键考虑因素。许多近似搜索工具(如NCBI BLAST)对每个查询执行数据库扫描,缺乏一种机制来避免访问没有可能匹配的数据库部分。在本文中,我们提出了一种使用基于fpga的模式过滤器将每个搜索查询转换为一组潜在的数据库匹配的方法,从而减少了每个查询访问的数据库的平均部分。该方法基于一种模式滤波器的硬件设计,该滤波器可以实现每个周期一个模式的持续识别率。我们使用Vivado HLS来设计过滤器。尽管存在循环携带的依赖关系,但最终的设计在代码的算术强度和可用内存带宽的约束下满足了最大可能。在本文中,我们描述了过滤器的实现和我们的代码调优方法。
{"title":"High-level synthesis of a genomic database search engine","authors":"Rasha Karakchi, Jordan A. Bradshaw, J. Bakos","doi":"10.1109/ReConFig.2016.7857174","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857174","url":null,"abstract":"Genomic database search is an I/O-bound problem, so avoiding unnecessary I/O transactions is a key consideration for improving search throughput. Many approximate search tools such as NCBI BLAST perform a database scan for each query, lacking a mechanism to avoid access to portions of the database that offer no potential for a match. In this paper we present an approach for using an FPGA-based pattern filter to convert each search query into a set of potential database matches that reduces the average portion of the database accessed per query. The approach is based on a hardware design for a pattern filter that can achieve a sustained recognition rate of one pattern per cycle. We used Vivado HLS to design the filter. Despite the presence of loop-carried dependencies, the final design meets the maximum possible throughout as constrained by the code's arithmetic intensity and available memory bandwidth. In this paper we describe the filter implementation and our code tuning methodology.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126343622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The portable open-source IP core and utility library PoC 便携式开源IP核和实用程序库PoC
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857191
Thomas B. Preußer, M. Zabel, Patrick Lehmann, R. Spallek
Standard libraries and frameworks boost the productivity and performance significantly as they enable the re-use of optimized solutions for standard tasks. Hardware designs are often unnecessarily complex because a) a rich RTL library of standard solutions is missing and b) designs must often sacrifice portable and readable behavioral descriptions so as to meet timing and area constraints on the targeted device. The PoC Library addresses these issues. First of all, it provides abstracted solutions for standard tasks. These include single- and dual-port memory components as well as higher-level data structures such as FIFOs, stacks and deques built on top of them. The library further comprises cross-clock triggers, arithmetic and algorithmic cores, as for wide addition and sorting, as well as communication stack implementations. Each implementation is encapsulated by a stable interface that is independent from the specific target platform. Nonetheless, device-specific optimizations are available through specialized implementations, which are selected internally whenever this is beneficial or necessitated by the vendor flow. The provided modules are highly parametrizable to fit the application needs and enable design space exploration. An extensive set of utility functions and frequently used data types benefits the conciseness of both library and user code. Finally, PoC enables the continuous verification of its IP cores by automated testbenches. This verification flow is only one part of a flow infrastructure that also supports the generation of re-usable netlists as to speed up the integration of more complex cores into an application design. The flow infrastructure is implemented in Python and supports various simulation backends, synthesis tool chains and operating systems.
标准库和框架极大地提高了生产力和性能,因为它们支持为标准任务重用优化的解决方案。硬件设计通常是不必要的复杂,因为a)缺乏丰富的RTL标准解决方案库,b)设计必须经常牺牲可移植和可读的行为描述,以满足目标设备的时间和面积限制。PoC库解决了这些问题。首先,它为标准任务提供了抽象的解决方案。这些包括单端口和双端口内存组件,以及建立在它们之上的fifo、堆栈和队列等高级数据结构。该库还包括跨时钟触发器、算术和算法核心,用于广泛的加法和排序,以及通信堆栈实现。每个实现都由独立于特定目标平台的稳定接口封装。尽管如此,特定于设备的优化可以通过专门的实现来实现,只要供应商流程认为有益或必要,就可以在内部选择这些实现。所提供的模块是高度可参数化的,以适应应用程序的需要,并支持设计空间的探索。一组广泛的实用程序函数和常用的数据类型有利于库和用户代码的简洁性。最后,PoC可以通过自动化测试台对其IP核进行连续验证。此验证流只是流基础设施的一部分,该基础设施还支持生成可重用的网络列表,以加速将更复杂的核心集成到应用程序设计中。流基础设施是用Python实现的,支持各种模拟后端、合成工具链和操作系统。
{"title":"The portable open-source IP core and utility library PoC","authors":"Thomas B. Preußer, M. Zabel, Patrick Lehmann, R. Spallek","doi":"10.1109/ReConFig.2016.7857191","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857191","url":null,"abstract":"Standard libraries and frameworks boost the productivity and performance significantly as they enable the re-use of optimized solutions for standard tasks. Hardware designs are often unnecessarily complex because a) a rich RTL library of standard solutions is missing and b) designs must often sacrifice portable and readable behavioral descriptions so as to meet timing and area constraints on the targeted device. The PoC Library addresses these issues. First of all, it provides abstracted solutions for standard tasks. These include single- and dual-port memory components as well as higher-level data structures such as FIFOs, stacks and deques built on top of them. The library further comprises cross-clock triggers, arithmetic and algorithmic cores, as for wide addition and sorting, as well as communication stack implementations. Each implementation is encapsulated by a stable interface that is independent from the specific target platform. Nonetheless, device-specific optimizations are available through specialized implementations, which are selected internally whenever this is beneficial or necessitated by the vendor flow. The provided modules are highly parametrizable to fit the application needs and enable design space exploration. An extensive set of utility functions and frequently used data types benefits the conciseness of both library and user code. Finally, PoC enables the continuous verification of its IP cores by automated testbenches. This verification flow is only one part of a flow infrastructure that also supports the generation of re-usable netlists as to speed up the integration of more complex cores into an application design. The flow infrastructure is implemented in Python and supports various simulation backends, synthesis tool chains and operating systems.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134422999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Automatic framework to generate reconfigurable accelerators for option pricing applications 为期权定价应用程序生成可重构加速器的自动框架
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857157
Pham Nam Khanh, Khin Mi Mi Aung, Akash Kumar
Option Pricing is a fundamental application in most financial institutions dealing with derivative market. It frequently requires huge computational effort and low latency demand. Therefore, a number of different Option Pricing implementations have been developed on FPGA-based platform. However, none of the existing works cover more than one models or different types of options, which yields problem of productively implementing several hardware accelerators for different models. To fill in the gap, we propose a design flow for generating efficient hardware accelerators for option pricing applications with different models and option types. The framework boosts the designers productivity and enables quick prototyping on FPGA platform by providing general template architecture for option pricing applications. The architecture comes along with a prebuilt design library, which covers a wide range of popular financial models. Experimental results for four models show that the accelerators generated from our design flow outperform their counterpart software implementation with two order of magnitude speedup. While comparing with existing hardware designs for the same models, our framework can produce the accelerators that overcome most of manual designed engines.
期权定价是大多数金融机构处理衍生品市场的基本应用。它通常需要大量的计算工作和低延迟需求。因此,在基于fpga的平台上开发了许多不同的期权定价实现。然而,现有的工作都没有涵盖一个以上的模型或不同类型的选项,这就产生了为不同模型有效地实现几个硬件加速器的问题。为了填补这一空白,我们提出了一个针对不同模型和期权类型的期权定价应用程序生成高效硬件加速器的设计流程。该框架通过为期权定价应用程序提供通用模板架构,提高了设计人员的工作效率,并实现了FPGA平台上的快速原型设计。该架构附带了一个预先构建的设计库,其中涵盖了广泛的流行金融模型。四种模型的实验结果表明,根据我们的设计流程生成的加速器比对应的软件实现速度提高了两个数量级。与现有同型号的硬件设计相比,我们的框架可以生产出克服大多数手动设计引擎的加速器。
{"title":"Automatic framework to generate reconfigurable accelerators for option pricing applications","authors":"Pham Nam Khanh, Khin Mi Mi Aung, Akash Kumar","doi":"10.1109/ReConFig.2016.7857157","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857157","url":null,"abstract":"Option Pricing is a fundamental application in most financial institutions dealing with derivative market. It frequently requires huge computational effort and low latency demand. Therefore, a number of different Option Pricing implementations have been developed on FPGA-based platform. However, none of the existing works cover more than one models or different types of options, which yields problem of productively implementing several hardware accelerators for different models. To fill in the gap, we propose a design flow for generating efficient hardware accelerators for option pricing applications with different models and option types. The framework boosts the designers productivity and enables quick prototyping on FPGA platform by providing general template architecture for option pricing applications. The architecture comes along with a prebuilt design library, which covers a wide range of popular financial models. Experimental results for four models show that the accelerators generated from our design flow outperform their counterpart software implementation with two order of magnitude speedup. While comparing with existing hardware designs for the same models, our framework can produce the accelerators that overcome most of manual designed engines.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"693 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113994405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Survey on and re-evaluation of wide adder architectures on FPGAs fpga上宽加法器结构的研究与再评价
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857189
Thomas B. Preußer, Markus Krause
The binary word addition was one of the earliest operations that called for special-purpose hardware structures on otherwise freely programmable logic devices. The large logic depth induced by the great fanin that comprises both operands of the addition is especially harmful in SRAM-programmed FPGAs where the delays of the configurable inter-LUT routing are expensive in comparison to the delays of the connected logic stages. These costs have been addressed by carry chains that establish direct links through a linear series of configurable logic blocks on virtually all modern FPGA devices. This structure is so effective that it puts a simple linear adder layout at a great advantage. Although it must eventually recede behind more sophisticated hierarchical adder structures with logarithmic delays, the actual turning point has been pushed beyond operand widths of 50 bits or more. Thus, many designs can simply rely on the default addition that is so well supported directly by the hardware. This changes in the context of extraordinarily wide operands as they are often found in cryptographic applications. They require designers to identify an appropriate wide adder implementation that is able to meet their design goals. The typical bottleneck imposed by the wide fanin of addition is the achievable clock rate. Various authors have analyzed the performance of the classic fast addition schemes and proposed adder architectures that genuinely blend classic hierarchical approaches with the capabilities of the fast carry chains. This paper presents a survey across these proposals and re-evaluates them in the context of modern FPGA devices.
二进制单词加法运算是最早需要在自由可编程逻辑设备上使用专用硬件结构的运算之一。由包含加法的两个操作数的大fanin引起的大逻辑深度在sram编程的fpga中尤其有害,其中可配置inter-LUT路由的延迟与连接逻辑级的延迟相比是昂贵的。这些成本已经通过在几乎所有现代FPGA器件上通过线性系列可配置逻辑块建立直接链接的进位链来解决。这种结构非常有效,它使简单的线性加法器布局具有很大的优势。尽管它最终必须落后于具有对数延迟的更复杂的分层加法器结构,但实际的转折点已经超越了50位或更大的操作数宽度。因此,许多设计可以简单地依赖于由硬件直接支持的默认添加。这在非常宽的操作数上下文中发生了变化,因为它们经常出现在加密应用程序中。它们要求设计人员确定能够满足其设计目标的适当的宽加法器实现。大范围的加法带来的典型瓶颈是可实现的时钟速率。许多作者分析了经典快速加法方案的性能,并提出了真正将经典分层方法与快速进位链的能力相结合的加法器体系结构。本文概述了这些建议,并在现代FPGA设备的背景下重新评估它们。
{"title":"Survey on and re-evaluation of wide adder architectures on FPGAs","authors":"Thomas B. Preußer, Markus Krause","doi":"10.1109/ReConFig.2016.7857189","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857189","url":null,"abstract":"The binary word addition was one of the earliest operations that called for special-purpose hardware structures on otherwise freely programmable logic devices. The large logic depth induced by the great fanin that comprises both operands of the addition is especially harmful in SRAM-programmed FPGAs where the delays of the configurable inter-LUT routing are expensive in comparison to the delays of the connected logic stages. These costs have been addressed by carry chains that establish direct links through a linear series of configurable logic blocks on virtually all modern FPGA devices. This structure is so effective that it puts a simple linear adder layout at a great advantage. Although it must eventually recede behind more sophisticated hierarchical adder structures with logarithmic delays, the actual turning point has been pushed beyond operand widths of 50 bits or more. Thus, many designs can simply rely on the default addition that is so well supported directly by the hardware. This changes in the context of extraordinarily wide operands as they are often found in cryptographic applications. They require designers to identify an appropriate wide adder implementation that is able to meet their design goals. The typical bottleneck imposed by the wide fanin of addition is the achievable clock rate. Various authors have analyzed the performance of the classic fast addition schemes and proposed adder architectures that genuinely blend classic hierarchical approaches with the capabilities of the fast carry chains. This paper presents a survey across these proposals and re-evaluates them in the context of modern FPGA devices.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Detection and Isolation of permanent faults in FPGAs with remote access 具有远程访问的fpga永久故障的检测与隔离
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857165
F. Rittner, R. Glein, A. Heuberger
In this paper, we present a Fault Detection (FD) and Fault Isolation (FI) concept for permanent faults in SRAM-based FPGAs. Harsh environments, such as space, cause permanent faults, which results in defective hardware. Furthermore, physical inaccessible systems in such environments limit debugging capabilities. The proposed concept uses wireless remote debugging to access the remote device. The presumption of a permanent fault starts an extended hardware debugging procedure by performing off-line tests without user application. First, the FD process confirms the fault as permanent and reject temporary faults (known as static and transient). Afterwards, a fine-grain permanent FD and FI determines affected primitives in the FPGA and exclude this primitive from the FPGA design. This is realized with a customized recovery strategy for each primitive type, by blocking and bypassing these defective primitives. Focus of this paper is the feasibility of the concept.
在本文中,我们提出了基于sram的fpga永久故障的故障检测(FD)和故障隔离(FI)概念。恶劣的环境(如太空)会导致永久性故障,从而导致硬件缺陷。此外,这种环境中的物理不可访问系统限制了调试功能。提出的概念是利用无线远程调试来访问远程设备。永久故障的假设通过在没有用户应用程序的情况下执行脱机测试来启动扩展的硬件调试过程。首先,FD流程确认故障为永久性故障,并拒绝临时故障(称为静态和瞬态)。然后,精细的永久FD和FI确定FPGA中受影响的原语,并将该原语从FPGA设计中排除。这是通过阻塞和绕过这些有缺陷的原语,为每个原语类型定制恢复策略来实现的。本文的重点是概念的可行性。
{"title":"Detection and Isolation of permanent faults in FPGAs with remote access","authors":"F. Rittner, R. Glein, A. Heuberger","doi":"10.1109/ReConFig.2016.7857165","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857165","url":null,"abstract":"In this paper, we present a Fault Detection (FD) and Fault Isolation (FI) concept for permanent faults in SRAM-based FPGAs. Harsh environments, such as space, cause permanent faults, which results in defective hardware. Furthermore, physical inaccessible systems in such environments limit debugging capabilities. The proposed concept uses wireless remote debugging to access the remote device. The presumption of a permanent fault starts an extended hardware debugging procedure by performing off-line tests without user application. First, the FD process confirms the fault as permanent and reject temporary faults (known as static and transient). Afterwards, a fine-grain permanent FD and FI determines affected primitives in the FPGA and exclude this primitive from the FPGA design. This is realized with a customized recovery strategy for each primitive type, by blocking and bypassing these defective primitives. Focus of this paper is the feasibility of the concept.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125650535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
REoN: A protocol for reliable software-defined FPGA partial reconfiguration over network REoN:一种在网络上实现可靠的软件定义FPGA部分重构的协议
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857184
Vaibhawa Mishra, Qianqiao Chen, G. Zervas
This paper presents and defines a Reconfiguration over Network (REoN) protocol. It is a solution for a FPGA-based dynamically reconfigurable system, that offers partial (re)programming over the network without the need of a local/embedded soft/hard processor. This protocol can transport partial bit files from centralized control and management system via network resource management API to a FPGA empowered network node, using standard 10 Gbps Ethernet. This work architects and introduces a proprietary lightweight connection oriented protocol stack, which guarantees reliability over standard UDP/IP protocol. Hardware stack for standard networking protocols including remote reconfiguration engine directly interfaced with Xilinx Internal Configuration Access Port (ICAP). This minimizes FPGA resource requirements in re-programming the FPGA. The presented work is an enabling technology for a range of applications such as reconfigurable computing enabled Network Function Virtualization (NFV), function dis aggregation on data centres empowered by FPGA/SoCs, as well as Internet of Things (IoT).
本文提出并定义了一种REoN协议。它是一种基于fpga的动态可重构系统的解决方案,在不需要本地/嵌入式软/硬处理器的情况下,通过网络提供部分(重新)编程。该协议可以通过网络资源管理API将集中控制和管理系统的部分位文件传输到FPGA支持的网络节点,使用标准的10gbps以太网。这项工作构建并引入了一个专有的轻量级面向连接的协议栈,它保证了标准UDP/IP协议的可靠性。用于标准网络协议的硬件堆栈,包括与Xilinx内部配置访问端口(ICAP)直接接口的远程重新配置引擎。这在重新编程FPGA时最大限度地减少了FPGA资源需求。所提出的工作是一系列应用的使能技术,如可重构计算支持的网络功能虚拟化(NFV), FPGA/ soc支持的数据中心功能分解,以及物联网(IoT)。
{"title":"REoN: A protocol for reliable software-defined FPGA partial reconfiguration over network","authors":"Vaibhawa Mishra, Qianqiao Chen, G. Zervas","doi":"10.1109/ReConFig.2016.7857184","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857184","url":null,"abstract":"This paper presents and defines a Reconfiguration over Network (REoN) protocol. It is a solution for a FPGA-based dynamically reconfigurable system, that offers partial (re)programming over the network without the need of a local/embedded soft/hard processor. This protocol can transport partial bit files from centralized control and management system via network resource management API to a FPGA empowered network node, using standard 10 Gbps Ethernet. This work architects and introduces a proprietary lightweight connection oriented protocol stack, which guarantees reliability over standard UDP/IP protocol. Hardware stack for standard networking protocols including remote reconfiguration engine directly interfaced with Xilinx Internal Configuration Access Port (ICAP). This minimizes FPGA resource requirements in re-programming the FPGA. The presented work is an enabling technology for a range of applications such as reconfigurable computing enabled Network Function Virtualization (NFV), function dis aggregation on data centres empowered by FPGA/SoCs, as well as Internet of Things (IoT).","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126078391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
FPGA-based encrypted network traffic identification at 100 Gbit/s 基于fpga的100gbit /s加密网络流量识别
Pub Date : 2016-11-01 DOI: 10.1109/ReConFig.2016.7857172
Mario Ruiz, G. Sutter, S. López-Buedo, J. D. Vergara
Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed of network links. At 100 Gbit/s, the huge volume of data makes it very difficult to perform online analyses or to store traffic for subsequent forensic investigations. It is therefore mandatory to carry out some kind of filtering and/or capping in the network traffic to be analyzed. Additionally, the fraction of encrypted traffic is relentlessly increasing. For such encrypted traffic, storing the payload is most times useless. In this paper we present an FPGA implementation of a method to identify plain text (that is, human readable) in the network packet payload. The method is based on both detecting bursts of printable ASCII characters and calculating the fraction of these printable characters in the packet payload. This method has proven to be very effective in reducing the amount of information used in traffic analysis, by saving only the headers of packets with encrypted payloads. We leveraged the advantages of high-level languages to reduce development time, though traditional HDL languages were also used to optimize critical areas of the design. The design targets the 100 Gbit/s Ethernet interfaces of Xilinx Virtex UltraScale devices and it is able to detect human-readable packet payloads at line rate, with a high accuracy.
由于网络链路的速度越来越快,网络流量监控变得越来越难以管理。在100 Gbit/s的速度下,庞大的数据量使得执行在线分析或存储流量以供后续取证调查变得非常困难。因此,必须对要分析的网络流量执行某种过滤和/或封顶。此外,加密流量的比例正在不断增加。对于这种加密的通信,存储有效负载在大多数情况下是无用的。在本文中,我们提出了一种FPGA实现方法来识别网络数据包有效载荷中的纯文本(即人类可读)。该方法是基于检测可打印ASCII字符的爆发和计算这些可打印字符在数据包有效负载中的比例。这种方法已被证明在减少流量分析中使用的信息量方面非常有效,因为它只保存带有加密有效负载的数据包的报头。我们利用高级语言的优势来减少开发时间,尽管传统的HDL语言也用于优化设计的关键区域。该设计针对赛灵思Virtex UltraScale设备的100 Gbit/s以太网接口,能够以线速率检测人类可读的数据包有效负载,精度很高。
{"title":"FPGA-based encrypted network traffic identification at 100 Gbit/s","authors":"Mario Ruiz, G. Sutter, S. López-Buedo, J. D. Vergara","doi":"10.1109/ReConFig.2016.7857172","DOIUrl":"https://doi.org/10.1109/ReConFig.2016.7857172","url":null,"abstract":"Network traffic monitoring is becoming increasingly hard to manage due to the ever-growing speed of network links. At 100 Gbit/s, the huge volume of data makes it very difficult to perform online analyses or to store traffic for subsequent forensic investigations. It is therefore mandatory to carry out some kind of filtering and/or capping in the network traffic to be analyzed. Additionally, the fraction of encrypted traffic is relentlessly increasing. For such encrypted traffic, storing the payload is most times useless. In this paper we present an FPGA implementation of a method to identify plain text (that is, human readable) in the network packet payload. The method is based on both detecting bursts of printable ASCII characters and calculating the fraction of these printable characters in the packet payload. This method has proven to be very effective in reducing the amount of information used in traffic analysis, by saving only the headers of packets with encrypted payloads. We leveraged the advantages of high-level languages to reduce development time, though traditional HDL languages were also used to optimize critical areas of the design. The design targets the 100 Gbit/s Ethernet interfaces of Xilinx Virtex UltraScale devices and it is able to detect human-readable packet payloads at line rate, with a high accuracy.","PeriodicalId":431909,"journal":{"name":"2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128775252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2016 International Conference on ReConFigurable Computing and FPGAs (ReConFig)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1