This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].
{"title":"Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA","authors":"Xitian Fan, Chen-Mie Wu, Wei Cao, Xuegong Zhou, Shengye Wang, Lingli Wang","doi":"10.1109/FPT.2013.6718346","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718346","url":null,"abstract":"This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129826625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718327
Charles Chiasson, Vaughn Betz
In this paper, we present COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Automated transistor-level CAD tools are an important part of the architecture exploration flow because they provide accurate area and delay estimates of low-level FPGA circuitry, which must be obtained for each architecture. We show that modeling transistors as linear resistances and capacitances as has been done in previous FPGA transistor sizing tools is highly inaccurate for fine-grained transistor-level design in advanced process nodes. Therefore, COFFE's transistor sizing algorithm maintains circuit non-linearities by relying exclusively on HSPICE simulations to measure delay. Area is estimated with a transistor size-based model that incorporates a number of improvements to enhance its accuracy in advanced process technologies versus prior methods. In addition to more accurate area and delay estimation, COFFE considers more layout effects than prior published work by automatically accounting for transistor and wire loads, which are computed based on architectural parameters and layout area. This new FPGA transistor sizing tool requires only several hours to produce high-quality transistor sizing results for an entire FPGA tile; a task that would normally take months of manual effort. We demonstrate COFFE's utility in FPGA architecture studies by investigating an important new architectural question at the logic-to-routing interface.
在本文中,我们提出了COFFE (Circuit Optimization For FPGA Exploration),这是一种新的全自动FPGA晶体管尺寸工具。自动晶体管级CAD工具是架构探索流程的重要组成部分,因为它们提供了底层FPGA电路的精确面积和延迟估计,必须为每个架构获得。我们表明,将晶体管建模为线性电阻和电容,正如在以前的FPGA晶体管尺寸工具中所做的那样,对于高级工艺节点中的细粒度晶体管级设计是非常不准确的。因此,COFFE的晶体管尺寸算法通过完全依赖HSPICE模拟来测量延迟来保持电路非线性。面积估计是基于晶体管尺寸的模型,该模型结合了许多改进,以提高其在先进工艺技术中的准确性,而不是以前的方法。除了更精确的面积和延迟估计外,COFFE通过自动计算晶体管和电线负载(基于结构参数和布局面积计算),比先前发表的工作考虑了更多的布局效应。这种新的FPGA晶体管尺寸工具只需要几个小时就可以为整个FPGA瓦片产生高质量的晶体管尺寸结果;这项任务通常需要几个月的人工完成。我们通过研究逻辑到路由接口的一个重要的新架构问题来证明COFFE在FPGA架构研究中的实用性。
{"title":"COFFE: Fully-automated transistor sizing for FPGAs","authors":"Charles Chiasson, Vaughn Betz","doi":"10.1109/FPT.2013.6718327","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718327","url":null,"abstract":"In this paper, we present COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Automated transistor-level CAD tools are an important part of the architecture exploration flow because they provide accurate area and delay estimates of low-level FPGA circuitry, which must be obtained for each architecture. We show that modeling transistors as linear resistances and capacitances as has been done in previous FPGA transistor sizing tools is highly inaccurate for fine-grained transistor-level design in advanced process nodes. Therefore, COFFE's transistor sizing algorithm maintains circuit non-linearities by relying exclusively on HSPICE simulations to measure delay. Area is estimated with a transistor size-based model that incorporates a number of improvements to enhance its accuracy in advanced process technologies versus prior methods. In addition to more accurate area and delay estimation, COFFE considers more layout effects than prior published work by automatically accounting for transistor and wire loads, which are computed based on architectural parameters and layout area. This new FPGA transistor sizing tool requires only several hours to produce high-quality transistor sizing results for an entire FPGA tile; a task that would normally take months of manual effort. We demonstrate COFFE's utility in FPGA architecture studies by investigating an important new architectural question at the logic-to-routing interface.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128502843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718388
F. Winterstein, Samuel Bayliss, G. Constantinides
High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.
{"title":"High-level synthesis of dynamic data structures: A case study using Vivado HLS","authors":"F. Winterstein, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718388","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718388","url":null,"abstract":"High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125628172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718397
Elyas Abolhassani Ghazaani, Zana Ghaderi, S. Miremadi
This paper proposes a full-featured fault injection framework to assess reliability of FPGA-based designs. The framework provides non-intrusiveness, portability, flexibility and performance in reliability evaluation of FPGA-based designs against adverse effects of SEUs. It works in a non-intrusive manner, allowing the reliability of ready-to-be-released designs to be assessed independently, without any intrusion into their place and route characteristics. We have studied implications of framework's intrusiveness into design under test by comparing proposed non-intrusive framework with previous intrusive methods; up to 5% deviation in the number of effective faults is observed in intrusive methods. Providing portability, the framework can be applied for a wide variety of FPGAs. Allowing the user to define desired parameters for different fault injection strategies confirms framework's flexibility. Finally, the framework performs the process of injecting faults, evaluating design and removing faults in about 17ms, on average.
{"title":"A non-intrusive portable fault injection framework to assess reliability of FPGA-based designs","authors":"Elyas Abolhassani Ghazaani, Zana Ghaderi, S. Miremadi","doi":"10.1109/FPT.2013.6718397","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718397","url":null,"abstract":"This paper proposes a full-featured fault injection framework to assess reliability of FPGA-based designs. The framework provides non-intrusiveness, portability, flexibility and performance in reliability evaluation of FPGA-based designs against adverse effects of SEUs. It works in a non-intrusive manner, allowing the reliability of ready-to-be-released designs to be assessed independently, without any intrusion into their place and route characteristics. We have studied implications of framework's intrusiveness into design under test by comparing proposed non-intrusive framework with previous intrusive methods; up to 5% deviation in the number of effective faults is observed in intrusive methods. Providing portability, the framework can be applied for a wide variety of FPGAs. Allowing the user to define desired parameters for different fault injection strategies confirms framework's flexibility. Finally, the framework performs the process of injecting faults, evaluating design and removing faults in about 17ms, on average.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718400
Takumi Fujimori, Minoru Watanabe
This paper presents a proposal of a color configuration method for an optically reconfigurable gate array (ORGA). A conventional ORGA consists of a single-wavelength laser array to address configuration contexts. However, the new ORGA has lasers of some other wavelength inside a laser array. Consequently, the addressable number of configuration contexts can be increased.
{"title":"Color configuration method for an optically reconfigurable gate array","authors":"Takumi Fujimori, Minoru Watanabe","doi":"10.1109/FPT.2013.6718400","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718400","url":null,"abstract":"This paper presents a proposal of a color configuration method for an optically reconfigurable gate array (ORGA). A conventional ORGA consists of a single-wavelength laser array to address configuration contexts. However, the new ORGA has lasers of some other wavelength inside a laser array. Consequently, the addressable number of configuration contexts can be increased.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718396
Moo-Kyoung Chung, Jun-Kyoung Kim, Yeon-Gon Cho, Soojung Ryu
Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive code compression scheme for CGRA instructions based on dictionary-based compression, where compression mode and dictionary contents are adaptively selected for each execution kernel and compression group. In addition, it is able to design hardware decompressor efficiently with two-cycle latency and negligible silicon overhead. The proposed method achieved an average compression ratio 0.52 in a CGRA of 16-functional unit array with the experiments of well-optimized applications.
{"title":"Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures","authors":"Moo-Kyoung Chung, Jun-Kyoung Kim, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/FPT.2013.6718396","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718396","url":null,"abstract":"Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive code compression scheme for CGRA instructions based on dictionary-based compression, where compression mode and dictionary contents are adaptively selected for each execution kernel and compression group. In addition, it is able to design hardware decompressor efficiently with two-cycle latency and negligible silicon overhead. The proposed method achieved an average compression ratio 0.52 in a CGRA of 16-functional unit array with the experiments of well-optimized applications.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132860165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718341
N. George, D. Novo, Tiark Rompf, Martin Odersky, P. Ienne
Tools to design hardware at a high level of abstraction promise software-like productivity for hardware designs. Among them, tools like Spiral, HDL Coder, Optimus and MMAlpha target specific application domains and produce highly efficient implementations from high-level input specifications in a Domain Specific Language (DSL). But, developing similar domain-specific High-Level Synthesis (HLS) tools need enormous effort, which might offset their many advantages. In this paper, we propose a novel, cost-effective approach to develop domain-specific HLS tools. We develop the HLS tool by embedding its input DSL in Scala and using Lightweight Modular Staging (LMS), a compiler framework written in Scala, to perform optimizations at different abstraction levels. For example, to optimize computation on matrices, some optimizations are more effective when the program is represented at the level of matrices while others are better applied at the level of individual matrix elements. To illustrate the proposed approach, we create an HLS flow to automatically generate efficient hardware implementations of matrix expressions described in our own high-level specification language. Although a simple example, it shows how easy it is to reuse modules across different HLS flows and to integrate our flow with existing tools like LegUp, a C-to-RTL compiler, and FloPoCo, an arithmetic core generator. The results reveal that our approach can simultaneously achieve high productivity and design quality with a very reasonable tool development effort.
{"title":"Making domain-specific hardware synthesis tools cost-efficient","authors":"N. George, D. Novo, Tiark Rompf, Martin Odersky, P. Ienne","doi":"10.1109/FPT.2013.6718341","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718341","url":null,"abstract":"Tools to design hardware at a high level of abstraction promise software-like productivity for hardware designs. Among them, tools like Spiral, HDL Coder, Optimus and MMAlpha target specific application domains and produce highly efficient implementations from high-level input specifications in a Domain Specific Language (DSL). But, developing similar domain-specific High-Level Synthesis (HLS) tools need enormous effort, which might offset their many advantages. In this paper, we propose a novel, cost-effective approach to develop domain-specific HLS tools. We develop the HLS tool by embedding its input DSL in Scala and using Lightweight Modular Staging (LMS), a compiler framework written in Scala, to perform optimizations at different abstraction levels. For example, to optimize computation on matrices, some optimizations are more effective when the program is represented at the level of matrices while others are better applied at the level of individual matrix elements. To illustrate the proposed approach, we create an HLS flow to automatically generate efficient hardware implementations of matrix expressions described in our own high-level specification language. Although a simple example, it shows how easy it is to reuse modules across different HLS flows and to integrate our flow with existing tools like LegUp, a C-to-RTL compiler, and FloPoCo, an arithmetic core generator. The results reveal that our approach can simultaneously achieve high productivity and design quality with a very reasonable tool development effort.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122350382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718376
Henry Block, T. Maruyama
In this paper, we present a hardware acceleration approach for a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA. The algorithm is based on a stochastic local search with the progressive tree neighborhood. The hardware architecture is divided in different units, each of which performs a specific task of the algorithm, to take advantage of the parallel processing capabilities of the FPGA. We show results for four real-world biological datasets, and compare them against results from two programs: our C++ implementation and TNT (a program for phylogenetic analysis). High acceleration rates are obtained against our C++ implementation, but not against TNT, which even shows to be faster in some cases. We conclude our work with a discussion on this issue.
{"title":"A hardware acceleration of a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA","authors":"Henry Block, T. Maruyama","doi":"10.1109/FPT.2013.6718376","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718376","url":null,"abstract":"In this paper, we present a hardware acceleration approach for a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA. The algorithm is based on a stochastic local search with the progressive tree neighborhood. The hardware architecture is divided in different units, each of which performs a specific task of the algorithm, to take advantage of the parallel processing capabilities of the FPGA. We show results for four real-world biological datasets, and compare them against results from two programs: our C++ implementation and TNT (a program for phylogenetic analysis). High acceleration rates are obtained against our C++ implementation, but not against TNT, which even shows to be faster in some cases. We conclude our work with a discussion on this issue.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718383
Rie Uno, N. Ozaki, Mai Izawa, Akihito Tsusaka, Takaaki Miyajima, H. Amano
Cool Mega Array (CMA) is a low power reconfigurable processor array for battery driven mobile devices. A prototype chip CMA-1 consists of a 8 × 8 PE (Processing Element) array and a micro-controller for controlling data alignment. Because the PE array of CMA is built with a combinatorial circuit, it does not have a signal which tells that operation in the PE array was completed. A propagate delay of the whole PE array corresponding to the operation time was estimated by using the data path and mapping information in the design stage of the application. The timing information for gathering the data was specified in the microcode of the controller. However, since this timing is fixed, it cannot treat the variation of environment temperature and voltage scaling for the PE array. Here, a speculative gather system is proposed which sets the timing of collecting operation results from the PE array dynamically. By collecting results twice and comparing them, it guarantees the correctness of the operation results and adjusts the gather timing automatically. The speculative gather system is implemented in the CMA, and evaluation results appear that the performance is improved by 25.3% on average with the overhead of 0.5% in area and 3.1% in power consumption.
{"title":"A speculative gather system for Cool Mega-Array","authors":"Rie Uno, N. Ozaki, Mai Izawa, Akihito Tsusaka, Takaaki Miyajima, H. Amano","doi":"10.1109/FPT.2013.6718383","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718383","url":null,"abstract":"Cool Mega Array (CMA) is a low power reconfigurable processor array for battery driven mobile devices. A prototype chip CMA-1 consists of a 8 × 8 PE (Processing Element) array and a micro-controller for controlling data alignment. Because the PE array of CMA is built with a combinatorial circuit, it does not have a signal which tells that operation in the PE array was completed. A propagate delay of the whole PE array corresponding to the operation time was estimated by using the data path and mapping information in the design stage of the application. The timing information for gathering the data was specified in the microcode of the controller. However, since this timing is fixed, it cannot treat the variation of environment temperature and voltage scaling for the PE array. Here, a speculative gather system is proposed which sets the timing of collecting operation results from the PE array dynamically. By collecting results twice and comparing them, it guarantees the correctness of the operation results and adjusts the gather timing automatically. The speculative gather system is implemented in the CMA, and evaluation results appear that the performance is improved by 25.3% on average with the overhead of 0.5% in area and 3.1% in power consumption.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115973339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2013-12-01DOI: 10.1109/FPT.2013.6718394
Heinrich Riebler, Tobias Kenter, Christoph Sorge, Christian Plessl
Cold-boot attacks exploit the fact that DRAM contents are not immediately lost when a PC is powered off. Instead the contents decay rather slowly, in particular if the DRAM chips are cooled to low temperatures. This effect opens an attack vector on cryptographic applications that keep decrypted keys in DRAM. An attacker with access to the target computer can reboot it or remove the RAM modules and quickly copy the RAM contents to non-volatile memory. By exploiting the known cryptographic structure of the cipher and layout of the key data in memory, in our application an AES key schedule with redundancy, the resulting memory image can be searched for sections that could correspond to decayed cryptographic keys; then, the attacker can attempt to reconstruct the original key. However, the runtime of these algorithms grows rapidly with increasing memory image size, error rate and complexity of the bit error model, which limits the practicability of the approach. In this work, we study how the algorithm for key search can be accelerated with custom computing machines. We present an FPGA-based architecture on a Maxeler dataflow computing system that outperforms a software implementation up to 205x, which significantly improves the practicability of cold-attacks against AES.
{"title":"FPGA-accelerated key search for cold-boot attacks against AES","authors":"Heinrich Riebler, Tobias Kenter, Christoph Sorge, Christian Plessl","doi":"10.1109/FPT.2013.6718394","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718394","url":null,"abstract":"Cold-boot attacks exploit the fact that DRAM contents are not immediately lost when a PC is powered off. Instead the contents decay rather slowly, in particular if the DRAM chips are cooled to low temperatures. This effect opens an attack vector on cryptographic applications that keep decrypted keys in DRAM. An attacker with access to the target computer can reboot it or remove the RAM modules and quickly copy the RAM contents to non-volatile memory. By exploiting the known cryptographic structure of the cipher and layout of the key data in memory, in our application an AES key schedule with redundancy, the resulting memory image can be searched for sections that could correspond to decayed cryptographic keys; then, the attacker can attempt to reconstruct the original key. However, the runtime of these algorithms grows rapidly with increasing memory image size, error rate and complexity of the bit error model, which limits the practicability of the approach. In this work, we study how the algorithm for key search can be accelerated with custom computing machines. We present an FPGA-based architecture on a Maxeler dataflow computing system that outperforms a software implementation up to 205x, which significantly improves the practicability of cold-attacks against AES.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121933668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}