Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献_第6页

On energy efficiency and amdahl's law in FPGA based chip heterogeneous multiprocessor systems (abstract only) 基于FPGA的芯片异构多处理器系统的能效和amdahl定律(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554719

Sen Ma, D. Andrews

This poster presents our preliminary findings on the relationship between speedup and energy efficiency on FPGA based Chip Heterogeneous Multiprocessor Systems (CHMPs). While researchers have investigated how to tailor combinations of heterogeneous compute engines within a CHMP system to best meet the performance needs of specific applications, exploring how these optimized architectures also effect energy efficiency is not as well studied. We show that a simple relationship exists between the speedup these systems gain and their associated energy efficiency. We show that the simple relationship between Amdahl's law and energy efficiency. All the experiments result achieved through actual run time measurements on homogeneous and heterogeneous multiprocessor systems implemented within a Xilinx Virtex6 FPGA. We further show how a systems with 6 MicroBlaze soft processors' dynamic power and hence the overall energy efficiency of the system can be effected through transparent operating system control of the compute resources. We also present how to use clock gating to control the dynamic power consumption for each processor and with this careful power-aware management unit, the system's dynamic power consumption can follow the requirements of each application.

这张海报展示了我们对基于FPGA的芯片异构多处理器系统(chmp)的加速和能效之间关系的初步发现。虽然研究人员已经研究了如何在CHMP系统中定制异构计算引擎的组合，以最好地满足特定应用程序的性能需求，但对这些优化架构如何影响能源效率的研究还没有得到很好的研究。我们证明了这些系统获得的加速与相关的能源效率之间存在简单的关系。我们展示了阿姆达尔定律和能源效率之间的简单关系。所有的实验结果都是通过在Xilinx Virtex6 FPGA上实现的同构和异构多处理器系统的实际运行时间测量得到的。我们进一步展示了如何通过透明的操作系统对计算资源的控制来实现具有6个MicroBlaze软处理器的动态功率和系统的整体能源效率。我们还介绍了如何使用时钟门控来控制每个处理器的动态功耗，并且使用这种精心的功耗感知管理单元，系统的动态功耗可以遵循每个应用程序的要求。

{"title":"On energy efficiency and amdahl's law in FPGA based chip heterogeneous multiprocessor systems (abstract only)","authors":"Sen Ma, D. Andrews","doi":"10.1145/2554688.2554719","DOIUrl":"https://doi.org/10.1145/2554688.2554719","url":null,"abstract":"This poster presents our preliminary findings on the relationship between speedup and energy efficiency on FPGA based Chip Heterogeneous Multiprocessor Systems (CHMPs). While researchers have investigated how to tailor combinations of heterogeneous compute engines within a CHMP system to best meet the performance needs of specific applications, exploring how these optimized architectures also effect energy efficiency is not as well studied. We show that a simple relationship exists between the speedup these systems gain and their associated energy efficiency. We show that the simple relationship between Amdahl's law and energy efficiency. All the experiments result achieved through actual run time measurements on homogeneous and heterogeneous multiprocessor systems implemented within a Xilinx Virtex6 FPGA. We further show how a systems with 6 MicroBlaze soft processors' dynamic power and hence the overall energy efficiency of the system can be effected through transparent operating system control of the compute resources. We also present how to use clock gating to control the dynamic power consumption for each processor and with this careful power-aware management unit, the system's dynamic power consumption can follow the requirements of each application.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MORP: makespan optimization for processors with an embedded reconfigurable fabric MORP:具有嵌入式可重构结构的处理器的最大寿命优化

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554782

Artjom Grudnitsky, L. Bauer, J. Henkel

Processors with an embedded runtime reconfigurable fabric have been explored in academia and industry started production of commercial platforms (e.g. Xilinx Zynq-7000). While providing significant performance and efficiency, the comparatively long reconfiguration time limits these advantages when applications request reconfigurations frequently. In multi-tasking systems frequent task switches lead to frequent reconfigurations and thus are a major hurdle for further performance increases. Sophisticated task scheduling is a very effective means to reduce the negative impact of these reconfiguration requests. In this paper, we propose an online approach for combined task scheduling and re-distribution of reconfigurable fabric between tasks in order to reduce the makespan, i.e. the completion time of a taskset that executes on a runtime reconfigurable processor. Evaluating multiple tasksets comprised of multimedia applications, our proposed approach achieves makespans that are on average only 2.8% worse than those achieved by a theoretical optimal scheduling that assumes zero-overhead reconfiguration time. In comparison, scheduling approaches deployed in state-of-the-art reconfigurable processors achieve makespans 14%-20% worse than optimal. As our approach is a purely software-side mechanism, a multitude of reconfigurable platforms aimed at multi-tasking can benefit from it.

具有嵌入式运行时可重构结构的处理器已经在学术界进行了探索，并且工业界已经开始生产商业平台(例如Xilinx Zynq-7000)。虽然提供了显著的性能和效率，但是当应用程序频繁请求重新配置时，相对较长的重新配置时间限制了这些优势。在多任务系统中，频繁的任务切换导致频繁的重新配置，从而成为进一步提高性能的主要障碍。复杂的任务调度是减少这些重新配置请求的负面影响的一种非常有效的手段。在本文中，我们提出了一种在线的组合任务调度和可重构结构在任务之间的重新分配方法，以减少makespan，即在运行时可重构处理器上执行的任务集的完成时间。评估由多媒体应用程序组成的多个任务集，我们提出的方法实现的完工时间平均仅比假设零开销重新配置时间的理论最优调度实现的完工时间差2.8%。相比之下，部署在最先进的可重构处理器上的调度方法的makespans比最优值低14%-20%。由于我们的方法是一种纯粹的软件端机制，许多针对多任务的可重构平台都可以从中受益。

{"title":"MORP: makespan optimization for processors with an embedded reconfigurable fabric","authors":"Artjom Grudnitsky, L. Bauer, J. Henkel","doi":"10.1145/2554688.2554782","DOIUrl":"https://doi.org/10.1145/2554688.2554782","url":null,"abstract":"Processors with an embedded runtime reconfigurable fabric have been explored in academia and industry started production of commercial platforms (e.g. Xilinx Zynq-7000). While providing significant performance and efficiency, the comparatively long reconfiguration time limits these advantages when applications request reconfigurations frequently. In multi-tasking systems frequent task switches lead to frequent reconfigurations and thus are a major hurdle for further performance increases. Sophisticated task scheduling is a very effective means to reduce the negative impact of these reconfiguration requests. In this paper, we propose an online approach for combined task scheduling and re-distribution of reconfigurable fabric between tasks in order to reduce the makespan, i.e. the completion time of a taskset that executes on a runtime reconfigurable processor. Evaluating multiple tasksets comprised of multimedia applications, our proposed approach achieves makespans that are on average only 2.8% worse than those achieved by a theoretical optimal scheduling that assumes zero-overhead reconfiguration time. In comparison, scheduling approaches deployed in state-of-the-art reconfigurable processors achieve makespans 14%-20% worse than optimal. As our approach is a purely software-side mechanism, a multitude of reconfigurable platforms aimed at multi-tasking can benefit from it.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115743071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating frequent item counting with FPGA 用FPGA加速频繁的项目计数

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554766

Yuliang Sun, Zilong Wang, Sitao Huang, Lanjun Wang, Yu Wang, Rong Luo, Huazhong Yang

Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.

频繁项计数是时间序列数据挖掘算法中最重要的操作之一，而节省空间算法是解决这一问题的一种广泛使用的方法。随着数据输入速度的迅速提高，频繁计数最具挑战性的问题是如何满足线速处理的要求。本文在FPGA上提出了一种面向流的pe环框架，用于统计频繁项。与现有最佳FPGA实现相比，我们的基本pe环框架节省了50%的查找表资源成本，并以更大的可扩展性实现了相同的吞吐量。此外，为了进一步提高性能，我们采用了类似simd的级联滤波器，在一些数据分布中，性能比以前的工作高出3.24倍。

引用次数: 15

Square-rich fixed point polynomial evaluation on FPGAs fpga的富平方不动点多项式求值

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554779

Simin Xu, Suhaib A. Fahmy, I. Mcloughlin

Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials.

多项式求值在广泛的应用领域中具有重要意义，因此在加速其计算方面已经做了大量的工作。传统的算法，称为霍纳规则，涉及最少的步骤数，但由于串行计算，可能导致延迟增加。并行计算算法，如Estrin的方法，比Horner的规则具有更短的延迟，但要实现这一点，需要付出很大的硬件开销。本文提出了一种高效的多项式求值算法，该算法对求值过程进行了改进，增加了平方步数。通过使用比一般乘法更有效的平方设计，这可以导致多项式计算比Horner规则减少57.9%的延迟，比Estrin方法减少14.6%，同时在Xilinx Virtex 6 FPGA上实现时比Horner规则消耗更少的面积。当应用于不动点函数求值时，其中精度要求限制了操作数的舍入，与Horner规则相比，它仍然实现了52.4%的性能增益，而在求5次多项式时仅增加了4%的面积开销。

{"title":"Square-rich fixed point polynomial evaluation on FPGAs","authors":"Simin Xu, Suhaib A. Fahmy, I. Mcloughlin","doi":"10.1145/2554688.2554779","DOIUrl":"https://doi.org/10.1145/2554688.2554779","url":null,"abstract":"Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130692676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

A configurable mapreduce accelerator for multi-core FPGAs (abstract only) 用于多核fpga的可配置mapreduce加速器(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554700

C. Kachris, G. Sirakoulis, D. Soudris

MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the processing of multi-core and cloud computing applications based on the MapReduce programming framework. The proposed MapReduce configurable accelerator is augmented to multi-core processors and it performs a fast indexing and accumulation of the key/value pairs based on an efficient memory architecture using Cuckoo hashing. The MapReduce accelerator consists of the memory buffers that store the key/value pairs, and the processing units that are used to accumulate the key's value sent from the processors. In essence, this accelerator is used to alleviate the processors from executing the Reduce tasks, and thus executing only the Map tasks and emitting the intermediate key/value pairs to the hardware acceleration unit that performs the Reduce operation. The number and the size of the keys that can be stored on the accelerator are configurable and can be configured based on the application requirements. The MapReduce accelerator has been implemented and mapped to a multi-core FPGA with embedded ARM processors (Xilinx Zynq FPGA) and has been integrated with the MapReduce programming framework under Linux. The performance evaluation shows that the proposed accelerator can achieve up to 1.8x system speedup of the MapReduce applications and hence reduce significantly the execution time of multi-core and cloud computing applications. (Action: "Supporting Postdoctoral Researchers", "Education and Lifelong Learning" Program (GSRT) and co-financed by the ESF and the Greek State.)

MapReduce是一种广泛应用于数据中心实现云计算应用的编程框架。这项工作提出了一种新的可配置硬件加速器，用于加速基于MapReduce编程框架的多核和云计算应用程序的处理。提出的MapReduce可配置加速器扩展到多核处理器，并基于Cuckoo哈希的高效内存架构执行键/值对的快速索引和积累。MapReduce加速器由存储键/值对的内存缓冲区和用于累积从处理器发送的键值的处理单元组成。实质上，这个加速器用于减轻处理器执行Reduce任务的负担，从而只执行Map任务，并将中间键/值对发送给执行Reduce操作的硬件加速单元。可以存储在加速器上的键的数量和大小是可配置的，可以根据应用程序需求进行配置。MapReduce加速器已经实现并映射到带有嵌入式ARM处理器(Xilinx Zynq FPGA)的多核FPGA上，并与Linux下的MapReduce编程框架集成在一起。性能评估表明，所提出的加速器可以使MapReduce应用程序达到1.8倍的系统加速，从而显著减少多核和云计算应用程序的执行时间。(行动:“支持博士后研究人员”，“教育和终身学习”计划(GSRT)，由ESF和希腊国家共同资助。)

{"title":"A configurable mapreduce accelerator for multi-core FPGAs (abstract only)","authors":"C. Kachris, G. Sirakoulis, D. Soudris","doi":"10.1145/2554688.2554700","DOIUrl":"https://doi.org/10.1145/2554688.2554700","url":null,"abstract":"MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the processing of multi-core and cloud computing applications based on the MapReduce programming framework. The proposed MapReduce configurable accelerator is augmented to multi-core processors and it performs a fast indexing and accumulation of the key/value pairs based on an efficient memory architecture using Cuckoo hashing. The MapReduce accelerator consists of the memory buffers that store the key/value pairs, and the processing units that are used to accumulate the key's value sent from the processors. In essence, this accelerator is used to alleviate the processors from executing the Reduce tasks, and thus executing only the Map tasks and emitting the intermediate key/value pairs to the hardware acceleration unit that performs the Reduce operation. The number and the size of the keys that can be stored on the accelerator are configurable and can be configured based on the application requirements. The MapReduce accelerator has been implemented and mapped to a multi-core FPGA with embedded ARM processors (Xilinx Zynq FPGA) and has been integrated with the MapReduce programming framework under Linux. The performance evaluation shows that the proposed accelerator can achieve up to 1.8x system speedup of the MapReduce applications and hence reduce significantly the execution time of multi-core and cloud computing applications. (Action: \"Supporting Postdoctoral Researchers\", \"Education and Lifelong Learning\" Program (GSRT) and co-financed by the ESF and the Greek State.)","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126455964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

EPEE: an efficient PCIe communication library with easy-host-integration property for FPGA accelerators (abstract only) EPEE:用于FPGA加速器的高效PCIe通信库，具有易于主机集成的特性(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554723

Jian Gong, Jiahua Chen, Haoyang Wu, Fan Ye, Songwu Lu, J. Cong, Tao Wang

The rapid growth in the resources and processing power of FPGA has made it more and more attractive as accelerator platforms. Due to its high performance, the PCIe bus is the preferred interconnection between the host computer and loosely-coupled FPGA accelerators. To fully utilize the high performance of PCIe, developers have to write significant amount of PCIe related code. In this paper, we present the design of EPEE, an efficient PCIe communication library that can integrate with hosts easily to alleviate developers from such burden. It is not trivial to make a PCIe communication library highly efficient and easy-host-integration simultaneously. We have identified several challenges in the work: 1) the conflict between efficiency and functionality; 2) the support for multi-clock domain interface; 3) the solution to DMA data out-of-order transfer; 4) the portability. Few existing systems have addressed all the challenges. EEPE has a highly efficient core library that is extensible. We provide a set of APIs abstracted at high levels to ease the learning curve of developers, and divide the hardware library into device dependent and independent layers for portability. We have implemented EEPE in various generations of Xilinx FPGAs with up to 12.7 Gbps half-duplex and 20.8 Gbps full-duplex data rates in PCIe Gen2X4 mode (79.4% and 64.0% of the theoretical maximum data rates respectively). EEPE has already been used in four different FPGA applications, and it can be integrated with high-level synthesis tools, in particular Vivado-HLS.

FPGA的资源和处理能力的快速增长使其作为加速器平台越来越具有吸引力。PCIe总线性能优越，是主机与松耦合FPGA加速器之间的首选互连方式。为了充分利用PCIe的高性能，开发人员必须编写大量的PCIe相关代码。在本文中，我们设计了一个高效的PCIe通信库EPEE，它可以很容易地与主机集成，从而减轻开发人员的负担。使PCIe通信库同时具有高效率和易于主机集成的特点并非易事。我们已经确定了工作中的几个挑战:1)效率和功能之间的冲突;2)支持多时钟域接口;3) DMA数据乱序传输的解决方案;4)便携性。很少有现有的系统能够解决所有的挑战。EEPE有一个可扩展的高效核心库。我们提供了一组在高层抽象的api，以简化开发人员的学习曲线，并将硬件库划分为设备依赖层和独立层，以实现可移植性。我们已经在多代Xilinx fpga中实现了EEPE，在PCIe Gen2X4模式下，其半双工数据速率高达12.7 Gbps，全双工数据速率高达20.8 Gbps(分别为理论最大数据速率的79.4%和64.0%)。EEPE已经在四种不同的FPGA应用中使用，它可以与高级合成工具集成，特别是Vivado-HLS。

{"title":"EPEE: an efficient PCIe communication library with easy-host-integration property for FPGA accelerators (abstract only)","authors":"Jian Gong, Jiahua Chen, Haoyang Wu, Fan Ye, Songwu Lu, J. Cong, Tao Wang","doi":"10.1145/2554688.2554723","DOIUrl":"https://doi.org/10.1145/2554688.2554723","url":null,"abstract":"The rapid growth in the resources and processing power of FPGA has made it more and more attractive as accelerator platforms. Due to its high performance, the PCIe bus is the preferred interconnection between the host computer and loosely-coupled FPGA accelerators. To fully utilize the high performance of PCIe, developers have to write significant amount of PCIe related code. In this paper, we present the design of EPEE, an efficient PCIe communication library that can integrate with hosts easily to alleviate developers from such burden. It is not trivial to make a PCIe communication library highly efficient and easy-host-integration simultaneously. We have identified several challenges in the work: 1) the conflict between efficiency and functionality; 2) the support for multi-clock domain interface; 3) the solution to DMA data out-of-order transfer; 4) the portability. Few existing systems have addressed all the challenges. EEPE has a highly efficient core library that is extensible. We provide a set of APIs abstracted at high levels to ease the learning curve of developers, and divide the hardware library into device dependent and independent layers for portability. We have implemented EEPE in various generations of Xilinx FPGAs with up to 12.7 Gbps half-duplex and 20.8 Gbps full-duplex data rates in PCIe Gen2X4 mode (79.4% and 64.0% of the theoretical maximum data rates respectively). EEPE has already been used in four different FPGA applications, and it can be integrated with high-level synthesis tools, in particular Vivado-HLS.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129259840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Improving the security and the scalability of the AES algorithm (abstract only) 提高AES算法的安全性和可扩展性(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554735

A. A. Nacci, V. Rana, M. Santambrogio, D. Sciuto

Although the reliability and robustness of the AES protocol have been deeply proved through the years, recent research results and technology advancements are rising serious concerns about its solidity in the (quite near) future. In fact, smarter brute force attacks and new computing systems are expected to drastically decrease the security of the AES protocol in the coming years (e.g., quantum computing will enable the development of search algorithms able to perform a brute force attack of a 2n-bit key in the same time required by a conventional algorithm for a n-bit key). In this context, we are proposing an extension of the AES algorithm in order to support longer encryption keys (thus increasing the security of the algorithm itself). In addition to this, we are proposing a set of parametric implementations of this novel extended protocols. These architectures can be optimized either to minimize the area usage or to maximize their performance. Experimental results show that, while the proposed implementations achieve a throughput higher than most of the state-of-the-art approaches and the highest value of the Performance/Area metric when working with 128-bit encryption keys, they can achieve a 84× throughput speedup when compared to the approaches that can be found in literature working with 512-bit encryption keys.

尽管多年来AES协议的可靠性和鲁棒性已经得到了深入的证明，但最近的研究结果和技术进步正在引起人们对其(相当近)未来的可靠性的严重关注。事实上，更智能的蛮力攻击和新的计算系统预计将在未来几年大幅降低AES协议的安全性(例如，量子计算将使搜索算法的发展能够在传统算法对n位密钥所需的相同时间内执行2n位密钥的蛮力攻击)。在这种情况下，我们建议对AES算法进行扩展，以支持更长的加密密钥(从而提高算法本身的安全性)。除此之外，我们还提出了一套新的扩展协议的参数化实现。可以对这些体系结构进行优化，以最小化面积使用或最大化其性能。实验结果表明，虽然提议的实现在使用128位加密密钥时实现了比大多数最先进的方法更高的吞吐量和最高的性能/面积度量值，但与文献中使用512位加密密钥的方法相比，它们可以实现84倍的吞吐量加速。

{"title":"Improving the security and the scalability of the AES algorithm (abstract only)","authors":"A. A. Nacci, V. Rana, M. Santambrogio, D. Sciuto","doi":"10.1145/2554688.2554735","DOIUrl":"https://doi.org/10.1145/2554688.2554735","url":null,"abstract":"Although the reliability and robustness of the AES protocol have been deeply proved through the years, recent research results and technology advancements are rising serious concerns about its solidity in the (quite near) future. In fact, smarter brute force attacks and new computing systems are expected to drastically decrease the security of the AES protocol in the coming years (e.g., quantum computing will enable the development of search algorithms able to perform a brute force attack of a 2n-bit key in the same time required by a conventional algorithm for a n-bit key). In this context, we are proposing an extension of the AES algorithm in order to support longer encryption keys (thus increasing the security of the algorithm itself). In addition to this, we are proposing a set of parametric implementations of this novel extended protocols. These architectures can be optimized either to minimize the area usage or to maximize their performance. Experimental results show that, while the proposed implementations achieve a throughput higher than most of the state-of-the-art approaches and the highest value of the Performance/Area metric when working with 128-bit encryption keys, they can achieve a 84× throughput speedup when compared to the approaches that can be found in literature working with 512-bit encryption keys.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130995926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Application specific processor with high level synthesized instructions (abstract only) 具有高级合成指令的特定于应用程序的处理器(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554754

V. Pus, Pavel Benácek

The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.

本文研究了采用高级综合指令引擎的专用处理器的设计。该方法在FPGA高速网络流量测量处理器实例上进行了验证。我们新提出的概念称为软件定义监控(SDM)，它依赖于由可配置硬件加速器支持的软件中实现的高级监控任务。监控任务驻留在软件中，可以很容易地控制硬件为每个流保留的细节级别。这样，批量/无兴趣流量的测量被转移到硬件上，而感兴趣的流量则在软件中处理。SDM能够创建灵活的监控系统，能够在高吞吐量下进行深度数据包检测。我们介绍了处理器架构和一个工作流，允许根据C/ c++语言的描述创建硬件加速测量模块(指令)。处理器从主系统CPU中卸载各种聚合和统计信息。卸载的基本类型是NetFlow统计聚合。我们创建并评估了另外三个聚合指令，以演示系统的灵活性。与手写指令相比，高级合成指令在FPGA资源消耗和频率方面略差。然而，开发所需的时间大约是一半。

{"title":"Application specific processor with high level synthesized instructions (abstract only)","authors":"V. Pus, Pavel Benácek","doi":"10.1145/2554688.2554754","DOIUrl":"https://doi.org/10.1145/2554688.2554754","url":null,"abstract":"The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pushing the performance boundary of linear projection designs through device specific optimisations (abstract only) 通过特定于设备的优化来推动线性投影设计的性能边界(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554717

R. Duarte, C. Bouganis

The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.

制造工艺的不断扩大，加上对高性能设计的需求不断增长，意味着对所有设备进行相同处理的时代即将结束。提出的工作考虑面向设备的优化，以便通过关注算术运算符的超频进一步提高线性投影设计的性能。提出了一种在FPGA上加速线性投影设计的方法，该方法将超频条件下硬件性能的信息引入到应用层。该方法的新颖之处在于对最容易出错的算术运算符进行预表征，并在设计的高级优化过程中利用这些信息。这导致了一组电路设计，以最小的误差实现更高的吞吐量。FPGA器件适合这种优化，因为它们的可重构特性允许在最终系统设计之前对底层结构进行性能表征。报告的结果表明，当考虑到这种特定设备的优化时，可以实现系统性能的显着提升，即与现有方法相比，吞吐量的速度提高了1.85倍。

{"title":"Pushing the performance boundary of linear projection designs through device specific optimisations (abstract only)","authors":"R. Duarte, C. Bouganis","doi":"10.1145/2554688.2554717","DOIUrl":"https://doi.org/10.1145/2554688.2554717","url":null,"abstract":"The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115558770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FPGA LUT design for wide-band dynamic voltage and frequency scaled operation (abstract only) 用于宽带动态电压和频率缩放操作的FPGA LUT设计(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554708

M. Abusultan, S. Khatri

Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.

现场可编程门阵列(fpga)是设计灵活性的首选实现平台。然而，fpga的高功耗(由于其灵活的结构而产生)使它们对极低功耗应用的吸引力降低。在本文中，我们提出了一种FPGA查找表(LUT)的设计，其目标是在宽电源电压范围内无缝运行。同样的LUT设计能够在需要低功耗时在亚阈值电压下工作，在需要更快性能时在更高电压下工作。结果表明，对于采用22nm预测技术实现的6输入LUT，在亚阈值模式下操作LUT的功率和能量比完全供电电压工作低(~80x)和(~4x)。亚阈值操作的主要缺点是易受工艺、温度和电源电压(PVT)变化的影响。本文还介绍了一种闭环自适应偏置机构的设计和实验结果，以动态消除这些PVT变化。对于相同的22nm技术，我们证明了闭环自适应体偏置电路可以允许FPGA在跨越一个数量级(40 MHz至1300 MHz)的工作频率范围内工作。我们还表明，闭环自适应体偏置电路可以消除由电源电压变化引起的延迟变化，并将工艺变化对设置和保持时间的影响分别减少1.8倍和2.9倍。

{"title":"FPGA LUT design for wide-band dynamic voltage and frequency scaled operation (abstract only)","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2554688.2554708","DOIUrl":"https://doi.org/10.1145/2554688.2554708","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0