首页 > 最新文献

2013 International Conference on Field-Programmable Technology (FPT)最新文献

英文 中文
Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA OpenSURF算法的高性能硬件架构在FPGA上的实现
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718346
Xitian Fan, Chen-Mie Wu, Wei Cao, Xuegong Zhou, Shengye Wang, Lingli Wang
This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].
本文提出了一种基于OpenSURF的SURF算法的高性能硬件架构。为了实现高处理帧率,设计了具有几个特点的硬件结构。首先,提出滑动窗口方法,在选定的尺度水平上并行提取特征点;这样可以大大减少特征提取的时间成本。其次,在定向生成和描述符生成中提出了数据重用策略,以减少内存访问次数;这样,分别实现了3.87倍和2.25倍的加速。第三,为了支持在一个时钟周期内对多个数据进行访问,我们将整块图像分割到不同的内存块中进行缓冲,这将进一步减少我们实现的整个计算时间。硬件架构在156 MHz的XC6VSX475T FPGA上实现,其VGA格式图像的最大帧率可达356帧/秒(fps),是在Xeon 5650处理器服务器上运行OpenSURF帧率的6.25倍,是最近报道的三种Vritex4 FPGA上实现帧率的6倍[8]。
{"title":"Implementation of high performance hardware architecture of OpenSURF algorithm on FPGA","authors":"Xitian Fan, Chen-Mie Wu, Wei Cao, Xuegong Zhou, Shengye Wang, Lingli Wang","doi":"10.1109/FPT.2013.6718346","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718346","url":null,"abstract":"This paper proposes a high performance hardware architecture of Speeded Up Robust Features (SURF) algorithm based on OpenSURF. In order to achieve high processing frame rate, the hardware architecture is designed with several characteristics. Firstly, a sliding window method is proposed to extract feature points in parallel at selected scale levels. As a result, the time cost in feature extraction can be greatly reduced. Secondly, data reuse strategy is proposed in orientation generation and descriptor generation to reduce the memory access times. In this way, 3.87x and 2.25X speedup are achieved respectively. Thirdly, the integral image is segmented to buffer in different memory blocks in order to support multiple data accessing in one clock cycle, which will further reduce the whole calculating time of our implementation. The hardware architecture is implemented on an XC6VSX475T FPGA with 156 MHz and its maximal frame rate for VGA format image can reach 356 frames per second (fps), which is 6.25 times frame rate of OpenSURF running on a server with a Xeon 5650 processor, and 6 times the reported frame rate of the recent implementation on three Vritex4 FPGAs [8].","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129826625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
COFFE: Fully-automated transistor sizing for FPGAs 咖啡:全自动晶体管尺寸的fpga
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718327
Charles Chiasson, Vaughn Betz
In this paper, we present COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Automated transistor-level CAD tools are an important part of the architecture exploration flow because they provide accurate area and delay estimates of low-level FPGA circuitry, which must be obtained for each architecture. We show that modeling transistors as linear resistances and capacitances as has been done in previous FPGA transistor sizing tools is highly inaccurate for fine-grained transistor-level design in advanced process nodes. Therefore, COFFE's transistor sizing algorithm maintains circuit non-linearities by relying exclusively on HSPICE simulations to measure delay. Area is estimated with a transistor size-based model that incorporates a number of improvements to enhance its accuracy in advanced process technologies versus prior methods. In addition to more accurate area and delay estimation, COFFE considers more layout effects than prior published work by automatically accounting for transistor and wire loads, which are computed based on architectural parameters and layout area. This new FPGA transistor sizing tool requires only several hours to produce high-quality transistor sizing results for an entire FPGA tile; a task that would normally take months of manual effort. We demonstrate COFFE's utility in FPGA architecture studies by investigating an important new architectural question at the logic-to-routing interface.
在本文中,我们提出了COFFE (Circuit Optimization For FPGA Exploration),这是一种新的全自动FPGA晶体管尺寸工具。自动晶体管级CAD工具是架构探索流程的重要组成部分,因为它们提供了底层FPGA电路的精确面积和延迟估计,必须为每个架构获得。我们表明,将晶体管建模为线性电阻和电容,正如在以前的FPGA晶体管尺寸工具中所做的那样,对于高级工艺节点中的细粒度晶体管级设计是非常不准确的。因此,COFFE的晶体管尺寸算法通过完全依赖HSPICE模拟来测量延迟来保持电路非线性。面积估计是基于晶体管尺寸的模型,该模型结合了许多改进,以提高其在先进工艺技术中的准确性,而不是以前的方法。除了更精确的面积和延迟估计外,COFFE通过自动计算晶体管和电线负载(基于结构参数和布局面积计算),比先前发表的工作考虑了更多的布局效应。这种新的FPGA晶体管尺寸工具只需要几个小时就可以为整个FPGA瓦片产生高质量的晶体管尺寸结果;这项任务通常需要几个月的人工完成。我们通过研究逻辑到路由接口的一个重要的新架构问题来证明COFFE在FPGA架构研究中的实用性。
{"title":"COFFE: Fully-automated transistor sizing for FPGAs","authors":"Charles Chiasson, Vaughn Betz","doi":"10.1109/FPT.2013.6718327","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718327","url":null,"abstract":"In this paper, we present COFFE (Circuit Optimization For FPGA Exploration), a new fully-automated transistor sizing tool for FPGAs. Automated transistor-level CAD tools are an important part of the architecture exploration flow because they provide accurate area and delay estimates of low-level FPGA circuitry, which must be obtained for each architecture. We show that modeling transistors as linear resistances and capacitances as has been done in previous FPGA transistor sizing tools is highly inaccurate for fine-grained transistor-level design in advanced process nodes. Therefore, COFFE's transistor sizing algorithm maintains circuit non-linearities by relying exclusively on HSPICE simulations to measure delay. Area is estimated with a transistor size-based model that incorporates a number of improvements to enhance its accuracy in advanced process technologies versus prior methods. In addition to more accurate area and delay estimation, COFFE considers more layout effects than prior published work by automatically accounting for transistor and wire loads, which are computed based on architectural parameters and layout area. This new FPGA transistor sizing tool requires only several hours to produce high-quality transistor sizing results for an entire FPGA tile; a task that would normally take months of manual effort. We demonstrate COFFE's utility in FPGA architecture studies by investigating an important new architectural question at the logic-to-routing interface.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128502843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
High-level synthesis of dynamic data structures: A case study using Vivado HLS 动态数据结构的高级综合:使用Vivado HLS的案例研究
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718388
F. Winterstein, Samuel Bayliss, G. Constantinides
High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.
与使用寄存器传输级(RTL)语言的设计入口相比,高级综合承诺显著缩短FPGA设计周期。最近的评估报告称,C-to-RTL流程可以产生接近手工设计的质量结果[1]。使用动态的、基于指针的数据结构的算法在软件中很常见,但很难很好地实现。在本文中,我们描述了使用Xilinx Vivado HLS作为示范性的最先进的高级合成工具的比较案例研究。我们的测试用例是同一种计算密集型机器学习技术(聚类)的两种可选算法,它们具有显著不同的计算特性。我们将以数据流为中心的实现与递归树遍历实现进行比较,后者结合了复杂的依赖数据的控制流,并利用指针链接的数据结构和动态内存分配。这个案例研究的结果是双重的:对于第一个测试用例,我们确认了手写和自动生成的RTL设计之间的相似性能。第二种情况表明,如果在高级合成之前未更改源代码,则延迟的降低幅度将大于30倍。我们确定了这个缺点的原因,并给出了将性能差距缩小到四倍的代码转换。我们概括了源到源的转换,这些转换的自动化激励了研究方向,以提高未来动态数据结构的高级综合。
{"title":"High-level synthesis of dynamic data structures: A case study using Vivado HLS","authors":"F. Winterstein, Samuel Bayliss, G. Constantinides","doi":"10.1109/FPT.2013.6718388","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718388","url":null,"abstract":"High-level synthesis promises a significant shortening of the FPGA design cycle when compared with design entry using register transfer level (RTL) languages. Recent evaluations report that C-to-RTL flows can produce results with a quality close to hand-crafted designs [1]. Algorithms which use dynamic, pointer-based data structures, which are common in software, remain difficult to implement well. In this paper, we describe a comparative case study using Xilinx Vivado HLS as an exemplary state-of-the-art high-level synthesis tool. Our test cases are two alternative algorithms for the same compute-intensive machine learning technique (clustering) with significantly different computational properties. We compare a data-flow centric implementation to a recursive tree traversal implementation which incorporates complex data-dependent control flow and makes use of pointer-linked data structures and dynamic memory allocation. The outcome of this case study is twofold: We confirm similar performance between the hand-written and automatically generated RTL designs for the first test case. The second case reveals a degradation in latency by a factor greater than 30× if the source code is not altered prior to high-level synthesis. We identify the reasons for this shortcoming and present code transformations that narrow the performance gap to a factor of four. We generalise our source-to-source transformations whose automation motivates research directions to improve high-level synthesis of dynamic data structures in the future.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125628172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 113
A non-intrusive portable fault injection framework to assess reliability of FPGA-based designs 一种非侵入式便携式故障注入框架,用于评估基于fpga设计的可靠性
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718397
Elyas Abolhassani Ghazaani, Zana Ghaderi, S. Miremadi
This paper proposes a full-featured fault injection framework to assess reliability of FPGA-based designs. The framework provides non-intrusiveness, portability, flexibility and performance in reliability evaluation of FPGA-based designs against adverse effects of SEUs. It works in a non-intrusive manner, allowing the reliability of ready-to-be-released designs to be assessed independently, without any intrusion into their place and route characteristics. We have studied implications of framework's intrusiveness into design under test by comparing proposed non-intrusive framework with previous intrusive methods; up to 5% deviation in the number of effective faults is observed in intrusive methods. Providing portability, the framework can be applied for a wide variety of FPGAs. Allowing the user to define desired parameters for different fault injection strategies confirms framework's flexibility. Finally, the framework performs the process of injecting faults, evaluating design and removing faults in about 17ms, on average.
本文提出了一个功能完备的故障注入框架来评估基于fpga设计的可靠性。该框架为基于fpga的设计提供了非侵入性、可移植性、灵活性和可靠性评估,以应对seu的不利影响。它以一种非侵入性的方式工作,允许独立评估即将发布的设计的可靠性,而不会侵入它们的位置和路线特征。我们通过比较提出的非侵入性框架和先前的侵入性方法,研究了框架侵入性对被测设计的影响;在侵入法中观察到有效断层的数量偏差达5%。提供可移植性,该框架可以应用于各种各样的fpga。允许用户为不同的故障注入策略定义所需的参数,确认了框架的灵活性。最后,该框架平均在17ms左右完成注入故障、评估设计和消除故障的过程。
{"title":"A non-intrusive portable fault injection framework to assess reliability of FPGA-based designs","authors":"Elyas Abolhassani Ghazaani, Zana Ghaderi, S. Miremadi","doi":"10.1109/FPT.2013.6718397","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718397","url":null,"abstract":"This paper proposes a full-featured fault injection framework to assess reliability of FPGA-based designs. The framework provides non-intrusiveness, portability, flexibility and performance in reliability evaluation of FPGA-based designs against adverse effects of SEUs. It works in a non-intrusive manner, allowing the reliability of ready-to-be-released designs to be assessed independently, without any intrusion into their place and route characteristics. We have studied implications of framework's intrusiveness into design under test by comparing proposed non-intrusive framework with previous intrusive methods; up to 5% deviation in the number of effective faults is observed in intrusive methods. Providing portability, the framework can be applied for a wide variety of FPGAs. Allowing the user to define desired parameters for different fault injection strategies confirms framework's flexibility. Finally, the framework performs the process of injecting faults, evaluating design and removing faults in about 17ms, on average.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Color configuration method for an optically reconfigurable gate array 用于光学可重构门阵列的颜色配置方法
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718400
Takumi Fujimori, Minoru Watanabe
This paper presents a proposal of a color configuration method for an optically reconfigurable gate array (ORGA). A conventional ORGA consists of a single-wavelength laser array to address configuration contexts. However, the new ORGA has lasers of some other wavelength inside a laser array. Consequently, the addressable number of configuration contexts can be increased.
提出了一种用于光可重构门阵列(ORGA)的颜色配置方法。传统的ORGA由单波长激光阵列组成,以解决配置环境。然而,新的ORGA在激光阵列中有一些其他波长的激光器。因此,可以增加配置上下文的可寻址数量。
{"title":"Color configuration method for an optically reconfigurable gate array","authors":"Takumi Fujimori, Minoru Watanabe","doi":"10.1109/FPT.2013.6718400","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718400","url":null,"abstract":"This paper presents a proposal of a color configuration method for an optically reconfigurable gate array (ORGA). A conventional ORGA consists of a single-wavelength laser array to address configuration contexts. However, the new ORGA has lasers of some other wavelength inside a laser array. Consequently, the addressable number of configuration contexts can be increased.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133081479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures 粗粒度可重构体系结构指令码的自适应压缩
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718396
Moo-Kyoung Chung, Jun-Kyoung Kim, Yeon-Gon Cho, Soojung Ryu
Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive code compression scheme for CGRA instructions based on dictionary-based compression, where compression mode and dictionary contents are adaptively selected for each execution kernel and compression group. In addition, it is able to design hardware decompressor efficiently with two-cycle latency and negligible silicon overhead. The proposed method achieved an average compression ratio 0.52 in a CGRA of 16-functional unit array with the experiments of well-optimized applications.
粗粒度可重构体系结构(CGRA)通过利用软件管道的指令级并行性来实现高性能。然而,大指令存储器是CGRA的一个关键问题,它需要大的硅面积和功耗。代码压缩是一种很有前途的技术,可以减少内存面积、带宽需求和功耗。本文提出了一种基于字典的CGRA指令自适应代码压缩方案,该方案为每个执行内核和压缩组自适应地选择压缩模式和字典内容。此外,它能够有效地设计硬件解压缩器,具有两周期延迟和可忽略不计的硅开销。该方法在16个功能单元阵列的CGRA中实现了平均压缩比0.52,并进行了优化应用实验。
{"title":"Adaptive compression for instruction code of Coarse Grained Reconfigurable Architectures","authors":"Moo-Kyoung Chung, Jun-Kyoung Kim, Yeon-Gon Cho, Soojung Ryu","doi":"10.1109/FPT.2013.6718396","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718396","url":null,"abstract":"Coarse Grained Reconfigurable Architecture (CGRA) achieves high performance by exploiting instruction-level parallelism with software pipeline. Large instruction memory is, however, a critical problem of CGRA, which requires large silicon area and power consumption. Code compression is a promising technique to reduce the memory area, bandwidth requirements, and power consumption. We present an adaptive code compression scheme for CGRA instructions based on dictionary-based compression, where compression mode and dictionary contents are adaptively selected for each execution kernel and compression group. In addition, it is able to design hardware decompressor efficiently with two-cycle latency and negligible silicon overhead. The proposed method achieved an average compression ratio 0.52 in a CGRA of 16-functional unit array with the experiments of well-optimized applications.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132860165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Making domain-specific hardware synthesis tools cost-efficient
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718341
N. George, D. Novo, Tiark Rompf, Martin Odersky, P. Ienne
Tools to design hardware at a high level of abstraction promise software-like productivity for hardware designs. Among them, tools like Spiral, HDL Coder, Optimus and MMAlpha target specific application domains and produce highly efficient implementations from high-level input specifications in a Domain Specific Language (DSL). But, developing similar domain-specific High-Level Synthesis (HLS) tools need enormous effort, which might offset their many advantages. In this paper, we propose a novel, cost-effective approach to develop domain-specific HLS tools. We develop the HLS tool by embedding its input DSL in Scala and using Lightweight Modular Staging (LMS), a compiler framework written in Scala, to perform optimizations at different abstraction levels. For example, to optimize computation on matrices, some optimizations are more effective when the program is represented at the level of matrices while others are better applied at the level of individual matrix elements. To illustrate the proposed approach, we create an HLS flow to automatically generate efficient hardware implementations of matrix expressions described in our own high-level specification language. Although a simple example, it shows how easy it is to reuse modules across different HLS flows and to integrate our flow with existing tools like LegUp, a C-to-RTL compiler, and FloPoCo, an arithmetic core generator. The results reveal that our approach can simultaneously achieve high productivity and design quality with a very reasonable tool development effort.
在高抽象层次上设计硬件的工具保证了硬件设计的类似软件的生产力。其中,像Spiral、HDL Coder、Optimus和MMAlpha这样的工具针对特定的应用领域,并通过领域特定语言(DSL)的高级输入规范生成高效的实现。但是,开发类似的特定于领域的高级综合(High-Level Synthesis, HLS)工具需要付出巨大的努力,这可能会抵消它们的许多优点。在本文中,我们提出了一种新颖的、经济有效的方法来开发特定领域的HLS工具。我们将HLS工具的输入DSL嵌入到Scala中,并使用轻量级模块化分期(LMS)(一个用Scala编写的编译器框架)在不同的抽象级别执行优化,从而开发出HLS工具。例如,为了优化矩阵上的计算,当程序在矩阵级别上表示时,一些优化更有效,而另一些优化则在单个矩阵元素级别上更好地应用。为了说明所提出的方法,我们创建了一个HLS流来自动生成用我们自己的高级规范语言描述的矩阵表达式的高效硬件实现。虽然是一个简单的示例,但它显示了跨不同HLS流重用模块以及将我们的流与现有工具(如LegUp (C-to-RTL编译器)和FloPoCo(算术核心生成器)集成是多么容易。结果表明,我们的方法可以同时实现高生产力和设计质量与一个非常合理的工具开发工作。
{"title":"Making domain-specific hardware synthesis tools cost-efficient","authors":"N. George, D. Novo, Tiark Rompf, Martin Odersky, P. Ienne","doi":"10.1109/FPT.2013.6718341","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718341","url":null,"abstract":"Tools to design hardware at a high level of abstraction promise software-like productivity for hardware designs. Among them, tools like Spiral, HDL Coder, Optimus and MMAlpha target specific application domains and produce highly efficient implementations from high-level input specifications in a Domain Specific Language (DSL). But, developing similar domain-specific High-Level Synthesis (HLS) tools need enormous effort, which might offset their many advantages. In this paper, we propose a novel, cost-effective approach to develop domain-specific HLS tools. We develop the HLS tool by embedding its input DSL in Scala and using Lightweight Modular Staging (LMS), a compiler framework written in Scala, to perform optimizations at different abstraction levels. For example, to optimize computation on matrices, some optimizations are more effective when the program is represented at the level of matrices while others are better applied at the level of individual matrix elements. To illustrate the proposed approach, we create an HLS flow to automatically generate efficient hardware implementations of matrix expressions described in our own high-level specification language. Although a simple example, it shows how easy it is to reuse modules across different HLS flows and to integrate our flow with existing tools like LegUp, a C-to-RTL compiler, and FloPoCo, an arithmetic core generator. The results reveal that our approach can simultaneously achieve high productivity and design quality with a very reasonable tool development effort.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122350382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
A hardware acceleration of a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA 基于FPGA的系统发育树重构的最大简约算法硬件加速
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718376
Henry Block, T. Maruyama
In this paper, we present a hardware acceleration approach for a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA. The algorithm is based on a stochastic local search with the progressive tree neighborhood. The hardware architecture is divided in different units, each of which performs a specific task of the algorithm, to take advantage of the parallel processing capabilities of the FPGA. We show results for four real-world biological datasets, and compare them against results from two programs: our C++ implementation and TNT (a program for phylogenetic analysis). High acceleration rates are obtained against our C++ implementation, but not against TNT, which even shows to be faster in some cases. We conclude our work with a discussion on this issue.
本文提出了一种基于FPGA的系统发育树重构的硬件加速方法。该算法基于渐进树邻域的随机局部搜索。硬件架构被划分为不同的单元,每个单元执行算法的特定任务,以利用FPGA的并行处理能力。我们展示了四个真实世界生物数据集的结果,并将它们与两个程序的结果进行了比较:我们的c++实现和TNT(一个系统发育分析程序)。在我们的c++实现中获得了很高的加速速率,但在TNT上没有,TNT在某些情况下甚至表现得更快。我们以讨论这个问题来结束我们的工作。
{"title":"A hardware acceleration of a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA","authors":"Henry Block, T. Maruyama","doi":"10.1109/FPT.2013.6718376","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718376","url":null,"abstract":"In this paper, we present a hardware acceleration approach for a phylogenetic tree reconstruction with maximum parsimony algorithm using FPGA. The algorithm is based on a stochastic local search with the progressive tree neighborhood. The hardware architecture is divided in different units, each of which performs a specific task of the algorithm, to take advantage of the parallel processing capabilities of the FPGA. We show results for four real-world biological datasets, and compare them against results from two programs: our C++ implementation and TNT (a program for phylogenetic analysis). High acceleration rates are obtained against our C++ implementation, but not against TNT, which even shows to be faster in some cases. We conclude our work with a discussion on this issue.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124599841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A speculative gather system for Cool Mega-Array 一个投机的收集系统为酷巨型阵列
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718383
Rie Uno, N. Ozaki, Mai Izawa, Akihito Tsusaka, Takaaki Miyajima, H. Amano
Cool Mega Array (CMA) is a low power reconfigurable processor array for battery driven mobile devices. A prototype chip CMA-1 consists of a 8 × 8 PE (Processing Element) array and a micro-controller for controlling data alignment. Because the PE array of CMA is built with a combinatorial circuit, it does not have a signal which tells that operation in the PE array was completed. A propagate delay of the whole PE array corresponding to the operation time was estimated by using the data path and mapping information in the design stage of the application. The timing information for gathering the data was specified in the microcode of the controller. However, since this timing is fixed, it cannot treat the variation of environment temperature and voltage scaling for the PE array. Here, a speculative gather system is proposed which sets the timing of collecting operation results from the PE array dynamically. By collecting results twice and comparing them, it guarantees the correctness of the operation results and adjusts the gather timing automatically. The speculative gather system is implemented in the CMA, and evaluation results appear that the performance is improved by 25.3% on average with the overhead of 0.5% in area and 3.1% in power consumption.
酷兆阵列(CMA)是一种低功耗可重构的处理器阵列,用于电池驱动的移动设备。原型芯片CMA-1由一个8 × 8的PE(处理元件)阵列和一个控制数据对齐的微控制器组成。由于CMA的PE阵列是用组合电路构成的,所以没有PE阵列操作完成的信号。利用应用程序设计阶段的数据路径和映射信息,估计了整个PE阵列对应于操作时间的传播延迟。采集数据的定时信息在控制器的微码中指定。然而,由于这个时间是固定的,它不能处理环境温度的变化和PE阵列的电压缩放。本文提出了一种推测采集系统,动态设置PE阵列采集操作结果的时间。通过两次采集和比对,保证了操作结果的正确性,并自动调整采集时间。在CMA中实现了推测采集系统,评估结果表明,该系统的性能平均提高了25.3%,面积开销减少0.5%,功耗减少3.1%。
{"title":"A speculative gather system for Cool Mega-Array","authors":"Rie Uno, N. Ozaki, Mai Izawa, Akihito Tsusaka, Takaaki Miyajima, H. Amano","doi":"10.1109/FPT.2013.6718383","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718383","url":null,"abstract":"Cool Mega Array (CMA) is a low power reconfigurable processor array for battery driven mobile devices. A prototype chip CMA-1 consists of a 8 × 8 PE (Processing Element) array and a micro-controller for controlling data alignment. Because the PE array of CMA is built with a combinatorial circuit, it does not have a signal which tells that operation in the PE array was completed. A propagate delay of the whole PE array corresponding to the operation time was estimated by using the data path and mapping information in the design stage of the application. The timing information for gathering the data was specified in the microcode of the controller. However, since this timing is fixed, it cannot treat the variation of environment temperature and voltage scaling for the PE array. Here, a speculative gather system is proposed which sets the timing of collecting operation results from the PE array dynamically. By collecting results twice and comparing them, it guarantees the correctness of the operation results and adjusts the gather timing automatically. The speculative gather system is implemented in the CMA, and evaluation results appear that the performance is improved by 25.3% on average with the overhead of 0.5% in area and 3.1% in power consumption.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115973339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
FPGA-accelerated key search for cold-boot attacks against AES 针对AES的冷启动攻击的fpga加速密钥搜索
Pub Date : 2013-12-01 DOI: 10.1109/FPT.2013.6718394
Heinrich Riebler, Tobias Kenter, Christoph Sorge, Christian Plessl
Cold-boot attacks exploit the fact that DRAM contents are not immediately lost when a PC is powered off. Instead the contents decay rather slowly, in particular if the DRAM chips are cooled to low temperatures. This effect opens an attack vector on cryptographic applications that keep decrypted keys in DRAM. An attacker with access to the target computer can reboot it or remove the RAM modules and quickly copy the RAM contents to non-volatile memory. By exploiting the known cryptographic structure of the cipher and layout of the key data in memory, in our application an AES key schedule with redundancy, the resulting memory image can be searched for sections that could correspond to decayed cryptographic keys; then, the attacker can attempt to reconstruct the original key. However, the runtime of these algorithms grows rapidly with increasing memory image size, error rate and complexity of the bit error model, which limits the practicability of the approach. In this work, we study how the algorithm for key search can be accelerated with custom computing machines. We present an FPGA-based architecture on a Maxeler dataflow computing system that outperforms a software implementation up to 205x, which significantly improves the practicability of cold-attacks against AES.
冷启动攻击利用了这样一个事实,即当PC关闭电源时,DRAM内容不会立即丢失。相反,内容物衰变相当缓慢,特别是当DRAM芯片冷却到低温时。这种效应为将解密密钥保存在DRAM中的加密应用程序打开了一个攻击向量。进入目标计算机的攻击者可以重新启动计算机或移除RAM模块,并迅速将RAM内容复制到非易失性存储器中。通过利用已知密码的密码结构和内存中密钥数据的布局,在我们的应用程序中使用具有冗余的AES密钥调度,可以搜索得到的内存图像中可能对应于衰变加密密钥的部分;然后,攻击者可以尝试重建原始密钥。然而,这些算法的运行时间随着内存图像大小、误码率和误码模型复杂性的增加而迅速增长,这限制了该方法的实用性。在这项工作中,我们研究了如何使用自定义计算机器加速关键字搜索算法。我们在Maxeler数据流计算系统上提出了一种基于fpga的架构,其性能比软件实现高出205倍,这大大提高了针对AES的冷攻击的实用性。
{"title":"FPGA-accelerated key search for cold-boot attacks against AES","authors":"Heinrich Riebler, Tobias Kenter, Christoph Sorge, Christian Plessl","doi":"10.1109/FPT.2013.6718394","DOIUrl":"https://doi.org/10.1109/FPT.2013.6718394","url":null,"abstract":"Cold-boot attacks exploit the fact that DRAM contents are not immediately lost when a PC is powered off. Instead the contents decay rather slowly, in particular if the DRAM chips are cooled to low temperatures. This effect opens an attack vector on cryptographic applications that keep decrypted keys in DRAM. An attacker with access to the target computer can reboot it or remove the RAM modules and quickly copy the RAM contents to non-volatile memory. By exploiting the known cryptographic structure of the cipher and layout of the key data in memory, in our application an AES key schedule with redundancy, the resulting memory image can be searched for sections that could correspond to decayed cryptographic keys; then, the attacker can attempt to reconstruct the original key. However, the runtime of these algorithms grows rapidly with increasing memory image size, error rate and complexity of the bit error model, which limits the practicability of the approach. In this work, we study how the algorithm for key search can be accelerated with custom computing machines. We present an FPGA-based architecture on a Maxeler dataflow computing system that outperforms a software implementation up to 205x, which significantly improves the practicability of cold-attacks against AES.","PeriodicalId":344469,"journal":{"name":"2013 International Conference on Field-Programmable Technology (FPT)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121933668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
2013 International Conference on Field-Programmable Technology (FPT)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1