首页 > 最新文献

2009 International Conference on Reconfigurable Computing and FPGAs最新文献

英文 中文
Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture 基于可扩展宏流水线FPGA加速架构的矩阵乘法
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.30
Jiang Jiang, Vincent Mirian, Kam Pui Tang, P. Chow, Zuocheng Xing
In this paper, we introduce a scalable macro-pipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect.
在本文中,我们引入了一个可扩展的宏管道架构来执行浮点矩阵乘法,该架构旨在利用时间并行性和架构可扩展性。我们在包含Virtex-5 XC5VFX70T的Xilinx ML507开发板上用16个处理元件(pe)演示了硬件设计的功能。模拟了32*32 ~ 1024*1024矩阵尺寸的32- pe设计。我们的实验表明,我们在32个PE下实现了12.18 GFLOPS,或者每GHz性能每PE约1.90 GFLOPS,这超过了95%的PE使用率。此外,所提出的SMPA具有使用多个FPGA器件和高速互连扩展到数十或数百GFLOPS的能力。
{"title":"Matrix Multiplication Based on Scalable Macro-Pipelined FPGA Accelerator Architecture","authors":"Jiang Jiang, Vincent Mirian, Kam Pui Tang, P. Chow, Zuocheng Xing","doi":"10.1109/ReConFig.2009.30","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.30","url":null,"abstract":"In this paper, we introduce a scalable macro-pipelined architecture to perform floating point matrix multiplication, which aims to exploit temporal parallelism and architectural scalability. We demonstrate the functionality of the hardware design with 16 processing elements (PEs) on Xilinx ML507 development board containing Virtex-5 XC5VFX70T. A 32-PE design for matrix size ranging from 32*32 to 1024*1024 is also simulated. Our experiment shows that we have achieved 12.18 GFLOPS with 32 PEs or about 1.90 GFLOPS per PE per GHz performance, which is over 95% PE usage. Moreover, the proposed SMPA has the capability to scale up to tens or hundreds of GFLOPS using multiple FPGA devices and high speed interconnect.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115732011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Scalable Architecture for Multivariate Polynomial Evaluation on FPGA 基于FPGA的多元多项式求值可扩展架构
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.22
Mathieu Allard, P. Grogan, J. David
Polynomial evaluation is currently used in multiple domains such as image processing, control systems and applied mathematics. Its high demand in calculation time and the need for embedded solutions make it a good target application for a hardware-oriented solution. This paper presents a new scalable architecture and its FPGA implementation designed to exploit the high level of parallelism present in such applications. Illustrated by an example in the field of 3-D graphic computation, results show important acceleration factors varying from 178 to 880 for orders ranging from 4 to 19, while the associated hardware cost scales linearly with polynomial order. Moreover using parallel implementations of the architecture to evaluate multiple polynomials, acceleration factor as high as 30858 can be obtained compared to an execution on a single processor.
多项式评估目前在图像处理、控制系统和应用数学等多个领域得到了广泛的应用。它对计算时间的高要求和对嵌入式解决方案的需求使其成为面向硬件解决方案的良好目标应用。本文提出了一种新的可扩展架构及其FPGA实现,旨在利用此类应用中存在的高水平并行性。以三维图形计算领域为例,结果表明,在4 ~ 19阶的情况下,重要的加速因子在178 ~ 880之间变化,而相关的硬件成本随着多项式阶的增加呈线性增长。此外,使用该架构的并行实现来评估多个多项式,与在单个处理器上执行相比,可以获得高达30858的加速因子。
{"title":"A Scalable Architecture for Multivariate Polynomial Evaluation on FPGA","authors":"Mathieu Allard, P. Grogan, J. David","doi":"10.1109/ReConFig.2009.22","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.22","url":null,"abstract":"Polynomial evaluation is currently used in multiple domains such as image processing, control systems and applied mathematics. Its high demand in calculation time and the need for embedded solutions make it a good target application for a hardware-oriented solution. This paper presents a new scalable architecture and its FPGA implementation designed to exploit the high level of parallelism present in such applications. Illustrated by an example in the field of 3-D graphic computation, results show important acceleration factors varying from 178 to 880 for orders ranging from 4 to 19, while the associated hardware cost scales linearly with polynomial order. Moreover using parallel implementations of the architecture to evaluate multiple polynomials, acceleration factor as high as 30858 can be obtained compared to an execution on a single processor.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125219607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Runtime Memory Allocation in a Heterogeneous Reconfigurable Platform 异构可重构平台中的运行时内存分配
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.38
V. Sima, K. Bertels
In this paper, we present a runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories. We target a heterogeneous platform, with a complex memory hierarchy. Using special instrumentation, we determine what memory areas are used in functions that could run on different processing elements, like, for example a reconfigurable logic array. Based on profile information, the programmer annotates some functions as candidates for accelerated execution. Then, an algorithm decides the best allocation, taking into account the various processing elements and special scratch pad memories of the heterogeneous platform. Tests are performed on our prototype platform, a Virtex ML410 with Linux operating system, containing a PowerPC processor and a Xilinx FPGA, implementing the MOLEN programming paradigm. We test the algorithm using both state of the art H.264 video encoder as well as other synthetic applications. The performance improvement for the H.264 application is 14% compared to the software only version while the overhead is less than 1% of the application execution time. This improvement is the optimal improvement that can be obtained by optimizing the memory allocation. For the synthetic applications the results are within 5% of the optimum.
在本文中,我们提出了一种运行时内存分配算法,该算法旨在通过直接在本地暂存存储器中分配内存来大幅减少共享内存访问带来的开销。我们的目标是一个异构平台,具有复杂的内存层次结构。使用特殊的检测工具,我们确定可以在不同处理元素上运行的函数中使用哪些内存区域,例如可重构逻辑数组。基于概要信息,程序员将一些函数注释为加速执行的候选函数。然后,考虑到异构平台的各种处理元素和特殊的刮记板存储器,确定最佳分配算法。在我们的原型平台上进行了测试,该平台是带有Linux操作系统的Virtex ML410,包含PowerPC处理器和Xilinx FPGA,实现了MOLEN编程范例。我们使用最先进的H.264视频编码器以及其他合成应用程序来测试算法。与纯软件版本相比,H.264应用程序的性能提高了14%,而开销不到应用程序执行时间的1%。这种改进是通过优化内存分配所能获得的最优改进。对于合成应用,结果在最佳的5%以内。
{"title":"Runtime Memory Allocation in a Heterogeneous Reconfigurable Platform","authors":"V. Sima, K. Bertels","doi":"10.1109/ReConFig.2009.38","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.38","url":null,"abstract":"In this paper, we present a runtime memory allocation algorithm, that aims to substantially reduce the overhead caused by shared-memory accesses by allocating memory directly in the local scratch pad memories. We target a heterogeneous platform, with a complex memory hierarchy. Using special instrumentation, we determine what memory areas are used in functions that could run on different processing elements, like, for example a reconfigurable logic array. Based on profile information, the programmer annotates some functions as candidates for accelerated execution. Then, an algorithm decides the best allocation, taking into account the various processing elements and special scratch pad memories of the heterogeneous platform. Tests are performed on our prototype platform, a Virtex ML410 with Linux operating system, containing a PowerPC processor and a Xilinx FPGA, implementing the MOLEN programming paradigm. We test the algorithm using both state of the art H.264 video encoder as well as other synthetic applications. The performance improvement for the H.264 application is 14% compared to the software only version while the overhead is less than 1% of the application execution time. This improvement is the optimal improvement that can be obtained by optimizing the memory allocation. For the synthetic applications the results are within 5% of the optimum.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125356744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Framework for 2.5D NoC Exploration Using Homogeneous Networks over Heterogeneous Floorplans 基于异构平面上同构网络的2.5D NoC勘探框架
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.14
V. D. Paulo, Cristinel Ababei
In this paper, we propose a new 2.5D NoC architecture that uses a homogeneous network on one layer on top of a heterogeneous floorplanning layer. The purpose of this approach is to exploit the benefits of compact heterogeneous floorplans and regular mesh networks through an automated design space exploration procedure. A design methodology consisting of floorplanning and router assignment in a specifically designed tool that integrates a cycle accurate NoC simulator is implemented and used to investigate the new architecture. We use this tool to compute the flit latency and compare it to a conventional 2D implementation. The separation of cores and network on two different layers offers additional area that we use to improve the network performance by searching for the optimal buffers size, number of virtual channels or mesh size. Experimental results are application specific with potential significant performance improvements for some testcases.
在本文中,我们提出了一种新的2.5D NoC架构,该架构在异构平面规划层之上的一层使用同构网络。这种方法的目的是通过自动化的设计空间探索程序,利用紧凑的异构平面图和规则的网状网络的好处。设计方法包括在一个专门设计的工具中进行平面规划和路由器分配,该工具集成了一个周期精确的NoC模拟器,并用于研究新架构。我们使用这个工具来计算飞行延迟,并将其与传统的2D实现进行比较。内核和网络在两个不同层上的分离提供了额外的空间,我们可以通过搜索最佳缓冲区大小、虚拟通道数量或网格大小来提高网络性能。实验结果是特定于应用程序的,对于某些测试用例具有潜在的显著性能改进。
{"title":"A Framework for 2.5D NoC Exploration Using Homogeneous Networks over Heterogeneous Floorplans","authors":"V. D. Paulo, Cristinel Ababei","doi":"10.1109/ReConFig.2009.14","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.14","url":null,"abstract":"In this paper, we propose a new 2.5D NoC architecture that uses a homogeneous network on one layer on top of a heterogeneous floorplanning layer. The purpose of this approach is to exploit the benefits of compact heterogeneous floorplans and regular mesh networks through an automated design space exploration procedure. A design methodology consisting of floorplanning and router assignment in a specifically designed tool that integrates a cycle accurate NoC simulator is implemented and used to investigate the new architecture. We use this tool to compute the flit latency and compare it to a conventional 2D implementation. The separation of cores and network on two different layers offers additional area that we use to improve the network performance by searching for the optimal buffers size, number of virtual channels or mesh size. Experimental results are application specific with potential significant performance improvements for some testcases.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122748715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Protecting the NOEKEON Cipher against SCARE Attacks in FPGAs by Using Dynamic Implementations 利用动态实现保护fpga中的NOEKEON密码免受SCARE攻击
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.19
J. Bringer, H. Chabanne, J. Danger
Protecting an implementation against Side Channel Analysis for Reverse Engineering (SCARE) attacks is a great challenge and we address this challenge by presenting a first proof of concept. White-box cryptography has been developed to protect programs against an adversary who has full access to their software implementation. It has also been suggested as a countermeasure against side channel attacks and we examine here these techniques in the wider perspective of SCARE. We consider that the adversary has only access to the cryptographic device through its side channels and his goal is to recover the specifications of the algorithm. In this work, we focus on FPGA (Field-Programmable Gate Array) technologies and examine how to thwart SCARE attacks by implementing a block cipher following white-box techniques. The proposed principle is based on changing dynamically the implementations. It is illustrated by an example on the Noekeon cipher and feasibility in different FPGAs is studied.
保护实现免受反向工程侧信道分析(SCARE)攻击是一个巨大的挑战,我们通过提出第一个概念证明来解决这个挑战。开发白盒加密技术是为了保护程序免受完全访问其软件实现的对手的攻击。它也被建议作为对抗侧信道攻击的对策,我们在这里从更广泛的角度来研究这些技术。我们认为攻击者只能通过其侧信道访问加密设备,其目标是恢复算法的规格。在这项工作中,我们专注于FPGA(现场可编程门阵列)技术,并研究如何通过实施白盒技术的分组密码来阻止SCARE攻击。所提出的原则是基于动态更改实现。以Noekeon密码为例,研究了该算法在不同fpga中的可行性。
{"title":"Protecting the NOEKEON Cipher against SCARE Attacks in FPGAs by Using Dynamic Implementations","authors":"J. Bringer, H. Chabanne, J. Danger","doi":"10.1109/ReConFig.2009.19","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.19","url":null,"abstract":"Protecting an implementation against Side Channel Analysis for Reverse Engineering (SCARE) attacks is a great challenge and we address this challenge by presenting a first proof of concept. White-box cryptography has been developed to protect programs against an adversary who has full access to their software implementation. It has also been suggested as a countermeasure against side channel attacks and we examine here these techniques in the wider perspective of SCARE. We consider that the adversary has only access to the cryptographic device through its side channels and his goal is to recover the specifications of the algorithm. In this work, we focus on FPGA (Field-Programmable Gate Array) technologies and examine how to thwart SCARE attacks by implementing a block cipher following white-box techniques. The proposed principle is based on changing dynamically the implementations. It is illustrated by an example on the Noekeon cipher and feasibility in different FPGAs is studied.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114152017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation 一种用于FPGA加速指针数据结构的遍历缓存框架:以Barnes-Hut n体仿真为例
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.68
J. Coole, J. Wernsing, G. Stitt
Numerous studies have shown that field-programmable gate arrays (FPGAs) often achieve large speedups compared to microprocessors. However, one significant limitation of FPGAs that has prevented their use on important applications is the requirement for regular memory access patterns. Traversal caches were previously introduced to improve the performance of FPGA implementations of algorithms with irregular memory access patterns, especially those traversing pointer-based data structures. However, a significant limitation of previous traversal caches is that speedup was limited to traversals repeated frequently over time, thus preventing speedup for algorithms without repetition, even if the similarity between traversals was large. This paper presents a new framework that extends traversal caches to enable performance improvements in such cases and provides additional improvements through reduced memory accesses and parallel processing of multiple traversals. Most importantly, we show that, for algorithms with highly similar traversals, the traversal cache framework achieves approximately linear kernel speedup with additional area, thus eliminating the memory bandwidth bottleneck commonly associated with FPGAs. We evaluate the framework using a Barnes-Hut n-body simulation case study, showing application speedups ranging from 12x to 13.5x on a Virtex4 LX100 with projected speedups as high as 40x on today’s largest FPGAs.
大量研究表明,与微处理器相比,现场可编程门阵列(fpga)通常具有较大的速度。然而,fpga的一个重要限制阻碍了它们在重要应用程序上的使用,那就是对常规内存访问模式的要求。以前引入遍历缓存是为了提高具有不规则内存访问模式的算法的FPGA实现的性能,特别是那些遍历基于指针的数据结构的算法。然而,以前的遍历缓存的一个重要限制是,加速仅限于随着时间的推移而频繁重复的遍历,从而阻止了没有重复的算法的加速,即使遍历之间的相似性很大。本文提出了一个扩展遍历缓存的新框架,以在这种情况下实现性能改进,并通过减少内存访问和并行处理多次遍历提供额外的改进。最重要的是,我们表明,对于具有高度相似遍历的算法,遍历缓存框架实现了具有额外面积的近似线性内核加速,从而消除了通常与fpga相关的内存带宽瓶颈。我们使用Barnes-Hut n体模拟案例研究来评估该框架,显示在Virtex4 LX100上的应用程序加速范围从12倍到13.5倍,在当今最大的fpga上预计加速高达40倍。
{"title":"A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation","authors":"J. Coole, J. Wernsing, G. Stitt","doi":"10.1109/ReConFig.2009.68","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.68","url":null,"abstract":"Numerous studies have shown that field-programmable gate arrays (FPGAs) often achieve large speedups compared to microprocessors. However, one significant limitation of FPGAs that has prevented their use on important applications is the requirement for regular memory access patterns. Traversal caches were previously introduced to improve the performance of FPGA implementations of algorithms with irregular memory access patterns, especially those traversing pointer-based data structures. However, a significant limitation of previous traversal caches is that speedup was limited to traversals repeated frequently over time, thus preventing speedup for algorithms without repetition, even if the similarity between traversals was large. This paper presents a new framework that extends traversal caches to enable performance improvements in such cases and provides additional improvements through reduced memory accesses and parallel processing of multiple traversals. Most importantly, we show that, for algorithms with highly similar traversals, the traversal cache framework achieves approximately linear kernel speedup with additional area, thus eliminating the memory bandwidth bottleneck commonly associated with FPGAs. We evaluate the framework using a Barnes-Hut n-body simulation case study, showing application speedups ranging from 12x to 13.5x on a Virtex4 LX100 with projected speedups as high as 40x on today’s largest FPGAs.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130525664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Efficient Technique for the FPGA Implementation of the AES MixColumns Transformation AES混合列变换的高效FPGA实现技术
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.52
S. Ghaznavi, C. Gebotys, R. Elbaz
The advanced encryption standard, AES, is commonly used to provide several security services such as data confidentiality or authentication in embedded systems. However designing efficient hardware architectures with small hardware resource usage and short critical path delay is a challenge. In this paper, a new technique for the FPGA implementation of the MixColumns transformation, an important part of AES, is introduced. The proposed MixColumns architecture, targeting 4-input LUTs on an FPGA, uses up to 23% less hardware resources than previous research. Overall, incorporating the proposed technique along with block memories for the SubBytes transformation in the AES encryption reduces usage of hardware resources by up to 10% and 18% in terms of slices and LUTs, respectively. The improvement is obtained by more efficient resource sharing through expansion and rearrangement of the MixColumns equation with respect to the structure of FPGAs. This can be highly advantageous in an FPGA implementation of block cipher modes using AES in many secure embedded systems.
高级加密标准AES通常用于提供嵌入式系统中的数据机密性或身份验证等几种安全服务。然而,如何设计硬件资源占用少、关键路径延迟短的高效硬件体系结构是一个挑战。本文介绍了一种FPGA实现AES的重要组成部分MixColumns变换的新技术。提出的MixColumns架构,针对FPGA上的4输入lut,比以前的研究减少了23%的硬件资源。总的来说,将所提出的技术与用于AES加密中的SubBytes转换的块存储器结合起来,就片和lut而言,分别减少了高达10%和18%的硬件资源使用。针对fpga的结构,对MixColumns方程进行了扩展和重排,实现了更有效的资源共享。这在许多安全嵌入式系统中使用AES的分组密码模式的FPGA实现中是非常有利的。
{"title":"Efficient Technique for the FPGA Implementation of the AES MixColumns Transformation","authors":"S. Ghaznavi, C. Gebotys, R. Elbaz","doi":"10.1109/ReConFig.2009.52","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.52","url":null,"abstract":"The advanced encryption standard, AES, is commonly used to provide several security services such as data confidentiality or authentication in embedded systems. However designing efficient hardware architectures with small hardware resource usage and short critical path delay is a challenge. In this paper, a new technique for the FPGA implementation of the MixColumns transformation, an important part of AES, is introduced. The proposed MixColumns architecture, targeting 4-input LUTs on an FPGA, uses up to 23% less hardware resources than previous research. Overall, incorporating the proposed technique along with block memories for the SubBytes transformation in the AES encryption reduces usage of hardware resources by up to 10% and 18% in terms of slices and LUTs, respectively. The improvement is obtained by more efficient resource sharing through expansion and rearrangement of the MixColumns equation with respect to the structure of FPGAs. This can be highly advantageous in an FPGA implementation of block cipher modes using AES in many secure embedded systems.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128024951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
A Reconfigurable Architecture for Stereo-Assisted Detection of Point-Features for Robot Mapping 机器人测绘中立体辅助点特征检测的可重构体系
Pub Date : 2009-12-09 DOI: 10.1109/RECONFIG.2009.41
J. Kalomiros, J. Lygouras
A hardware-friendly procedure is presented for the extraction of point-features from stereo image pairs for the purpose of real-time robot motion estimation and 3-D environmental mapping. The procedure is implemented in reconfigurable hardware and is developed as a set of custom HDL library components ready for integration in a system-on-a-programmable-chip. The main hardware stages are a stereo accelerator, a left and right image corner detector and a stage performing left-right consistency check. For the stereo-processor stage we have implemented and tested a SAD-based component for local area-matching and a global-matching component based on a Maximum-Likelihood dynamic programming technique. The system includes a Nios II processor for data control and a USB 2.0 interface for host communication. Resource usage and 3D-mapping results are reported for different versions of the reconfigurable system.
提出了一种硬件友好的方法,用于从立体图像对中提取点特征,用于实时机器人运动估计和三维环境映射。该程序在可重构硬件中实现,并作为一组定制的HDL库组件开发,准备集成在可编程芯片上的系统中。主要硬件阶段是一个立体加速器,一个左右图像角检测器和一个执行左右一致性检查的阶段。在立体处理器阶段,我们实现并测试了一个基于ad的局部区域匹配组件和一个基于最大似然动态规划技术的全局匹配组件。该系统包括一个Nios II处理器用于数据控制和一个USB 2.0接口用于主机通信。资源使用和3d映射结果报告了不同版本的可重构系统。
{"title":"A Reconfigurable Architecture for Stereo-Assisted Detection of Point-Features for Robot Mapping","authors":"J. Kalomiros, J. Lygouras","doi":"10.1109/RECONFIG.2009.41","DOIUrl":"https://doi.org/10.1109/RECONFIG.2009.41","url":null,"abstract":"A hardware-friendly procedure is presented for the extraction of point-features from stereo image pairs for the purpose of real-time robot motion estimation and 3-D environmental mapping. The procedure is implemented in reconfigurable hardware and is developed as a set of custom HDL library components ready for integration in a system-on-a-programmable-chip. The main hardware stages are a stereo accelerator, a left and right image corner detector and a stage performing left-right consistency check. For the stereo-processor stage we have implemented and tested a SAD-based component for local area-matching and a global-matching component based on a Maximum-Likelihood dynamic programming technique. The system includes a Nios II processor for data control and a USB 2.0 interface for host communication. Resource usage and 3D-mapping results are reported for different versions of the reconfigurable system.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115813483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
FPGA Implementations of BCD Multipliers BCD乘法器的FPGA实现
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.28
G. Sutter, E. Todorovich, G. Bioul, M. Vazquez, J. Deschamps
This paper presents a number of approaches to implement decimal multiplication algorithms on Xilinx FPGA’s. A variety of algorithms for basic one by one digit multiplication are proposed and FPGA implementations are presented. Later on N by one digit and N by M digit multiplications are studied. Time and area results for sequential and combinational implementations show better figures compared with previous published work. Comparisons against binary fully-optimized multipliers emphasize the interest of the proposed design techniques.
本文介绍了在Xilinx FPGA上实现十进制乘法算法的几种方法。提出了多种基本的1乘1数乘法算法,并给出了FPGA实现。随后研究了N × 1位和N × M位乘法。时序和组合实现的时间和面积结果与先前发表的工作相比显示出更好的数字。与二进制完全优化乘法器的比较强调了所提出的设计技术的兴趣。
{"title":"FPGA Implementations of BCD Multipliers","authors":"G. Sutter, E. Todorovich, G. Bioul, M. Vazquez, J. Deschamps","doi":"10.1109/ReConFig.2009.28","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.28","url":null,"abstract":"This paper presents a number of approaches to implement decimal multiplication algorithms on Xilinx FPGA’s. A variety of algorithms for basic one by one digit multiplication are proposed and FPGA implementations are presented. Later on N by one digit and N by M digit multiplications are studied. Time and area results for sequential and combinational implementations show better figures compared with previous published work. Comparisons against binary fully-optimized multipliers emphasize the interest of the proposed design techniques.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129973144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Reconfigurable Hardware Implementation of Arithmetic Modulo Minimal Redundancy Cyclotomic Primes for ECC ECC算法模最小冗余环素数的可重构硬件实现
Pub Date : 2009-12-09 DOI: 10.1109/ReConFig.2009.67
Brian Baldwin, W. Marnane, R. Granger
The dominant cost in Elliptic Curve Cryptography (ECC) over prime fields is modular multiplication. Minimal Redundancy Cyclotomic Primes (MRCPs) were recently introduced by Granger~ea for use as base field moduli in ECC, since they permit a novel and very efficient modular multiplication algorithm. Here we consider a reconfigurable hardware implementation of arithmetic modulo a $258$-bit example, for use at the $128$-bit AES security level. We examine this implementation for speed and area using parallelisation methods and inbuilt FPGA resources. The results are compared against a current method in use, the Montgomery multiplier.
椭圆曲线密码(ECC)在素域上的主要开销是模乘法。最小冗余环素数(MRCPs)最近由Granger~ea引入,用于ECC中的基场模,因为它们允许一种新颖且非常有效的模乘法算法。这里我们考虑一个可重构的算术模的硬件实现,一个$258$位的例子,用于$128$位的AES安全级别。我们使用并行化方法和内置FPGA资源来检查这种实现的速度和面积。将结果与目前使用的蒙哥马利乘数法进行比较。
{"title":"Reconfigurable Hardware Implementation of Arithmetic Modulo Minimal Redundancy Cyclotomic Primes for ECC","authors":"Brian Baldwin, W. Marnane, R. Granger","doi":"10.1109/ReConFig.2009.67","DOIUrl":"https://doi.org/10.1109/ReConFig.2009.67","url":null,"abstract":"The dominant cost in Elliptic Curve Cryptography (ECC) over prime fields is modular multiplication. Minimal Redundancy Cyclotomic Primes (MRCPs) were recently introduced by Granger~ea for use as base field moduli in ECC, since they permit a novel and very efficient modular multiplication algorithm. Here we consider a reconfigurable hardware implementation of arithmetic modulo a $258$-bit example, for use at the $128$-bit AES security level. We examine this implementation for speed and area using parallelisation methods and inbuilt FPGA resources. The results are compared against a current method in use, the Montgomery multiplier.","PeriodicalId":325631,"journal":{"name":"2009 International Conference on Reconfigurable Computing and FPGAs","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132965595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2009 International Conference on Reconfigurable Computing and FPGAs
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1