首页 > 最新文献

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Reconstructing AES Key Schedules from Decayed Memory with FPGAs 利用fpga从衰减存储器重构AES密钥调度
Heinrich Riebler, Tobias Kenter, Christian Plessl, Christoph Sorge
In this paper, we study how AES key schedules can be reconstructed from decayed memory. This operation is a crucial and time consuming operation when trying to break encryption systems with cold-boot attacks. In software, the reconstruction of the AES master key can be performed using a recursive, branch-and-bound tree-search algorithm that exploits redundancies in the key schedule for constraining the search space. In this work, we investigate how this branch-and-bound algorithm can be accelerated with FPGAs. We translate the recursive search procedure to a state machine with an explicit stack for each recursion level and create optimized datapaths to accelerate in particular the processing of the most frequently accessed tree levels. We support two different decay models, of which especially the more realistic non-idealized asymmetric decay model causes very high runtimes in software. Our implementation on a Maxeler dataflow computing system outperforms a software implementation for this model by up to 27x, which makes cold-boot attacks against AES practical even for high error rates.
本文研究了如何从衰减的存储器中重构AES密钥调度。当试图用冷启动攻击来破坏加密系统时,此操作是一个至关重要且耗时的操作。在软件中,AES主密钥的重建可以使用递归的分支绑定树搜索算法来执行,该算法利用密钥调度中的冗余来限制搜索空间。在这项工作中,我们研究了如何用fpga加速这种分支定界算法。我们将递归搜索过程转换为具有每个递归级别显式堆栈的状态机,并创建优化的数据路径,以加速特别是最频繁访问的树级别的处理。我们支持两种不同的衰减模型,特别是更现实的非理想化的非对称衰减模型会导致软件的高运行时间。我们在Maxeler数据流计算系统上的实现比该模型的软件实现性能高出27倍,这使得针对AES的冷启动攻击即使在高错误率下也是可行的。
{"title":"Reconstructing AES Key Schedules from Decayed Memory with FPGAs","authors":"Heinrich Riebler, Tobias Kenter, Christian Plessl, Christoph Sorge","doi":"10.1109/FCCM.2014.67","DOIUrl":"https://doi.org/10.1109/FCCM.2014.67","url":null,"abstract":"In this paper, we study how AES key schedules can be reconstructed from decayed memory. This operation is a crucial and time consuming operation when trying to break encryption systems with cold-boot attacks. In software, the reconstruction of the AES master key can be performed using a recursive, branch-and-bound tree-search algorithm that exploits redundancies in the key schedule for constraining the search space. In this work, we investigate how this branch-and-bound algorithm can be accelerated with FPGAs. We translate the recursive search procedure to a state machine with an explicit stack for each recursion level and create optimized datapaths to accelerate in particular the processing of the most frequently accessed tree levels. We support two different decay models, of which especially the more realistic non-idealized asymmetric decay model causes very high runtimes in software. Our implementation on a Maxeler dataflow computing system outperforms a software implementation for this model by up to 27x, which makes cold-boot attacks against AES practical even for high error rates.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114289265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Fully-Pipelined FPGA Design for Tree-Reweighted Message Passing Algorithm 树加权消息传递算法的全流水线FPGA设计
Wenlai Zhao, H. Fu, Guangwen Yang
A Markov random field (MRF) is a set of random variables demonstrating a Markov property in the form of an undirected graph. Maximum a posteriori probability (MAP) inference is a class of methods that seek solutions of problems modeled by MRF. MRF has been a very popular and powerful tool in computer vision problems such as stereo matching and image segmentation [1]. Finding the optimal solution of the MRF MAP problem is an NP-hard problem. Inference algorithms often involve a heavy computation load. Therefore, most related works have focused on improving the performance and efficiency of algorithms. Hardware-based acceleration is one of the most practical solutions.
马尔可夫随机场(MRF)是以无向图的形式表现出马尔可夫性质的一组随机变量。最大后验概率(MAP)推理是一类寻求MRF模型问题解的方法。在立体匹配和图像分割等计算机视觉问题中,核磁共振成像已经成为一种非常流行和强大的工具。寻找MRF MAP问题的最优解是一个np困难问题。推理算法通常涉及大量的计算负载。因此,大多数相关工作都集中在提高算法的性能和效率上。基于硬件的加速是最实用的解决方案之一。
{"title":"A Fully-Pipelined FPGA Design for Tree-Reweighted Message Passing Algorithm","authors":"Wenlai Zhao, H. Fu, Guangwen Yang","doi":"10.1109/FCCM.2014.59","DOIUrl":"https://doi.org/10.1109/FCCM.2014.59","url":null,"abstract":"A Markov random field (MRF) is a set of random variables demonstrating a Markov property in the form of an undirected graph. Maximum a posteriori probability (MAP) inference is a class of methods that seek solutions of problems modeled by MRF. MRF has been a very popular and powerful tool in computer vision problems such as stereo matching and image segmentation [1]. Finding the optimal solution of the MRF MAP problem is an NP-hard problem. Inference algorithms often involve a heavy computation load. Therefore, most related works have focused on improving the performance and efficiency of algorithms. Hardware-based acceleration is one of the most practical solutions.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123596717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexibility and Circuit Overheads in Reconfigurable SIMD/MIMD Systems 可重构SIMD/MIMD系统的灵活性和电路开销
Saad Arrabi, D. Moore, L. Wang, K. Skadron, B. Calhoun, J. Lach, B. Meyer
Dynamically reconfigurable SIMD/MIMD architectures made from simple cores have emerged to exploit diverse forms of parallelism in applications [1,2]. In this work, we investigate the circuit-level overhead and flexibility tradeoffs of such architectures through the design of a custom reconfigurable SIMD/MIMD system.
由简单内核构成的动态可重构SIMD/MIMD架构已经出现,可以在应用程序中利用各种形式的并行性[1,2]。在这项工作中,我们通过设计一个定制的可重构SIMD/MIMD系统来研究这种体系结构的电路级开销和灵活性权衡。
{"title":"Flexibility and Circuit Overheads in Reconfigurable SIMD/MIMD Systems","authors":"Saad Arrabi, D. Moore, L. Wang, K. Skadron, B. Calhoun, J. Lach, B. Meyer","doi":"10.1109/FCCM.2014.71","DOIUrl":"https://doi.org/10.1109/FCCM.2014.71","url":null,"abstract":"Dynamically reconfigurable SIMD/MIMD architectures made from simple cores have emerged to exploit diverse forms of parallelism in applications [1,2]. In this work, we investigate the circuit-level overhead and flexibility tradeoffs of such architectures through the design of a custom reconfigurable SIMD/MIMD system.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"85 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123176223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast Design-Space Exploration Method for SW/HW Codesign on FPGAs fpga软件/硬件协同设计的快速设计空间探索方法
Yuki Ando, S. Shibata, S. Honda, H. Tomiyama, H. Takada
This paper presents an efficient design-space exploration method to identify the Pareto solution for the relation between the execution time and the hardware area. Initially, our method takes a particular system mapping that is surely in the Pareto solution, and then repeats the local search and the update of the Pareto solution until the Pareto solution reaches a steady state. Compared to genetic-algorithm-based methods, we found that our method outputs the Pareto solution with a smaller number of explorations for larger design spaces.
针对执行时间与硬件面积之间的关系,提出了一种有效的设计空间探索方法来识别Pareto解。首先,我们的方法采用一个特定的系统映射,它肯定在帕累托解中,然后重复局部搜索和更新帕累托解,直到帕累托解达到稳定状态。与基于遗传算法的方法相比,我们发现我们的方法在更大的设计空间中以更少的探索次数输出帕累托解。
{"title":"Fast Design-Space Exploration Method for SW/HW Codesign on FPGAs","authors":"Yuki Ando, S. Shibata, S. Honda, H. Tomiyama, H. Takada","doi":"10.1109/FCCM.2014.70","DOIUrl":"https://doi.org/10.1109/FCCM.2014.70","url":null,"abstract":"This paper presents an efficient design-space exploration method to identify the Pareto solution for the relation between the execution time and the hardware area. Initially, our method takes a particular system mapping that is surely in the Pareto solution, and then repeats the local search and the update of the Pareto solution until the Pareto solution reaches a steady state. Compared to genetic-algorithm-based methods, we found that our method outputs the Pareto solution with a smaller number of explorations for larger design spaces.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Power-Efficient FPGA-Based Mixture-of-Gaussian (MoG) Background Subtraction for Full-HD Resolution 基于低功耗fpga的高斯混合背景减相法
H. Tabkhi, Majid Sabbagh, G. Schirner
This short paper briefly describes an FPGA-based realization of MoG background subtraction operating at fullHD frame resolution. Our HW hand-crafted MoG consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. The results very high efficiency with a power consumption of less than 500 mW which is 600X more efficient than an embedded software solution.
本文简要介绍了一种基于fpga的全高清帧分辨率MoG背景减法的实现方法。我们的HW手工制作MoG由77个流水线级组成,在Zynq-7000 SoC上实现,工作频率为148.5 MHz。其结果是非常高的效率,功耗低于500兆瓦,比嵌入式软件解决方案效率高600倍。
{"title":"A Power-Efficient FPGA-Based Mixture-of-Gaussian (MoG) Background Subtraction for Full-HD Resolution","authors":"H. Tabkhi, Majid Sabbagh, G. Schirner","doi":"10.1109/FCCM.2014.76","DOIUrl":"https://doi.org/10.1109/FCCM.2014.76","url":null,"abstract":"This short paper briefly describes an FPGA-based realization of MoG background subtraction operating at fullHD frame resolution. Our HW hand-crafted MoG consists of 77 pipeline stages operating at 148.5 MHz implemented on a Zynq-7000 SoC. The results very high efficiency with a power consumption of less than 500 mW which is 600X more efficient than an embedded software solution.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127104282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A Grammar Induction Method for Clustering of Operations in Complex FPGA Designs 复杂FPGA设计中运算聚类的语法归纳法
Muhsen Owaida, C. Antonopoulos, Nikolaos Bellas
In large-scale datapaths, complex interconnection requirements limit resource utilization and often dominate critical path delay. A variety of scheduling and binding algorithms have been proposed to reduce routing requirements by clustering frequently-used set of operations to avoid longer, inter-operational interconnects. In this paper we introduce a grammar induction approach for datapath synthesis. The proposed approach deals with the problem of routing using information at a higher level of abstraction, even before resource scheduling and binding. It is applied on a given data flow graph (DFG) and builds a compact form of DFG by identifying and exploiting repetitive operations patterns with one or more outputs. Fully placed and routed circuits were successfully generated for complex designs that failed to be placed and routed by the standard manufacturer tool-chain without applying our method. Moreover, placement and routing time was accelerated by 16% on average. Our grammar-based approach achieved 12% reduction in area on average, mostly as a result of reducing multiplexer sizes and the number of flip-flops, without noticeable adverse effect on clock frequency. Our comparison with a state of the art algorithm described in [8] shows that our approach outperforms it in both reduction in FPGA area and time to place and route the design.
在大规模的数据路径中,复杂的互连需求限制了资源的利用,并经常主导关键路径的延迟。已经提出了各种调度和绑定算法,通过聚类常用的操作集来减少路由需求,以避免更长时间的操作间互连。本文介绍了一种用于数据路径合成的语法归纳法。建议的方法处理在更高抽象级别上使用信息的路由问题,甚至在资源调度和绑定之前。它应用于给定的数据流图(DFG),并通过识别和利用具有一个或多个输出的重复操作模式来构建DFG的紧凑形式。在没有应用我们的方法的情况下,对于标准制造商工具链无法放置和布线的复杂设计,成功地生成了完全放置和布线的电路。此外,放置和路由时间平均加快了16%。我们基于语法的方法平均减少了12%的面积,主要是由于减少了多路复用器的尺寸和触发器的数量,而对时钟频率没有明显的不利影响。我们与[8]中描述的最先进算法的比较表明,我们的方法在减少FPGA面积和放置和布线设计时间方面都优于它。
{"title":"A Grammar Induction Method for Clustering of Operations in Complex FPGA Designs","authors":"Muhsen Owaida, C. Antonopoulos, Nikolaos Bellas","doi":"10.1109/FCCM.2014.62","DOIUrl":"https://doi.org/10.1109/FCCM.2014.62","url":null,"abstract":"In large-scale datapaths, complex interconnection requirements limit resource utilization and often dominate critical path delay. A variety of scheduling and binding algorithms have been proposed to reduce routing requirements by clustering frequently-used set of operations to avoid longer, inter-operational interconnects. In this paper we introduce a grammar induction approach for datapath synthesis. The proposed approach deals with the problem of routing using information at a higher level of abstraction, even before resource scheduling and binding. It is applied on a given data flow graph (DFG) and builds a compact form of DFG by identifying and exploiting repetitive operations patterns with one or more outputs. Fully placed and routed circuits were successfully generated for complex designs that failed to be placed and routed by the standard manufacturer tool-chain without applying our method. Moreover, placement and routing time was accelerated by 16% on average. Our grammar-based approach achieved 12% reduction in area on average, mostly as a result of reducing multiplexer sizes and the number of flip-flops, without noticeable adverse effect on clock frequency. Our comparison with a state of the art algorithm described in [8] shows that our approach outperforms it in both reduction in FPGA area and time to place and route the design.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123818946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling Mixed-Architecture Processes in Tightly Coupled FPGA-CPU Reconfigurable Computers 紧耦合FPGA-CPU可重构计算机中混合架构进程的调度
B. K. Hamilton, M. Inggs, Hayden Kwok-Hay So
The design and implementation of a multitasking run-time system on a tightly coupled FPGA-CPU platform is presented. Using a mix of CPU and FPGA programmable logic for computing, user applications are executed as mixed-architecture processes from the perspective of the OS. Context switching mechanisms with hybrid scheduling containing both blocking and preemption support were implemented to support concurrent execution of multiple mixed-architecture processes, and evaluated under a synthetic workload.
提出了一种基于FPGA-CPU紧密耦合平台的多任务运行系统的设计与实现。使用混合的CPU和FPGA可编程逻辑进行计算,从操作系统的角度来看,用户应用程序作为混合架构进程执行。为了支持多个混合架构进程的并发执行,实现了包含阻塞和抢占支持的混合调度的上下文切换机制,并在合成工作负载下进行了评估。
{"title":"Scheduling Mixed-Architecture Processes in Tightly Coupled FPGA-CPU Reconfigurable Computers","authors":"B. K. Hamilton, M. Inggs, Hayden Kwok-Hay So","doi":"10.1109/.73","DOIUrl":"https://doi.org/10.1109/.73","url":null,"abstract":"The design and implementation of a multitasking run-time system on a tightly coupled FPGA-CPU platform is presented. Using a mix of CPU and FPGA programmable logic for computing, user applications are executed as mixed-architecture processes from the perspective of the OS. Context switching mechanisms with hybrid scheduling containing both blocking and preemption support were implemented to support concurrent execution of multiple mixed-architecture processes, and evaluated under a synthetic workload.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115267350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
FPGA Accelerated Online Boosting for Multi-target Tracking FPGA加速多目标跟踪的在线增强
Matthew Jacobsen, Pingfan Meng, Siddarth Sampangi, R. Kastner
Robust real time tracking of multiple targets is a requisite feature for many applications. Online boosting has become an effective approach for dealing with the variability in object appearance. This approach can adapt its classifier to changes in appearance at the cost of additional runtime computation. In this paper, we address the task of accelerating online boosting for multiple target tracking. We propose a FPGA hardware accelerated architecture to evaluate and train a boosted classifier in real time. A general purpose CPU based software-only implementation can track a single target at 17 frames per second (FPS). The FPGA accelerated design is capable of tracking a single target at 1160 FPS or 57 independent targets at 30 FPS. This represents a 68× speed up over software.
多目标的鲁棒实时跟踪是许多应用程序的必要功能。在线增强已成为处理物体外观变化的有效方法。这种方法可以使其分类器适应外观的变化,但代价是额外的运行时计算。本文主要研究多目标跟踪的在线加速问题。我们提出了一种FPGA硬件加速架构来实时评估和训练增强的分类器。基于CPU的通用软件实现可以以每秒17帧(FPS)的速度跟踪单个目标。FPGA加速设计能够以1160 FPS的速度跟踪单个目标或以30 FPS的速度跟踪57个独立目标。这代表了比软件快68倍的速度。
{"title":"FPGA Accelerated Online Boosting for Multi-target Tracking","authors":"Matthew Jacobsen, Pingfan Meng, Siddarth Sampangi, R. Kastner","doi":"10.1109/FCCM.2014.50","DOIUrl":"https://doi.org/10.1109/FCCM.2014.50","url":null,"abstract":"Robust real time tracking of multiple targets is a requisite feature for many applications. Online boosting has become an effective approach for dealing with the variability in object appearance. This approach can adapt its classifier to changes in appearance at the cost of additional runtime computation. In this paper, we address the task of accelerating online boosting for multiple target tracking. We propose a FPGA hardware accelerated architecture to evaluate and train a boosted classifier in real time. A general purpose CPU based software-only implementation can track a single target at 17 frames per second (FPS). The FPGA accelerated design is capable of tracking a single target at 1160 FPS or 57 independent targets at 30 FPS. This represents a 68× speed up over software.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"6 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120926291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Separation Logic-Assisted Code Transformations for Efficient High-Level Synthesis 用于高效高级综合的分离逻辑辅助代码转换
F. Winterstein, Samuel Bayliss, G. Constantinides
The capabilities of modern FPGAs permit the mapping of increasingly complex applications into reconfigurable hardware. High-level synthesis (HLS) promises a significant shortening of the FPGA design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. Applications using dynamic, pointer-based data structures and dynamic memory allocation, however, remain difficult to implement well, yet such constructs are widely used in software. Automated optimizations that aim to leverage the increased memory bandwidth of FPGAs by distributing the application data over separate banks of on-chip memory are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis of pointer-based memory accesses. In this work, we take a step towards closing this gap. We present a static analysis for pointer-manipulating programs which automatically splits heap-allocated data structures into disjoint, independent regions. The analysis leverages recent advances in separation logic, a theoretical framework for reasoning about heap-allocated data which has been successfully applied in recent software verification tools. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable automatic loop parallelization and memory partitioning by off-the-shelf HLS tools. We demonstrate the successful loop parallelization and memory partitioning by our tool flow using three real-life applications which build, traverse, update and dispose dynamically allocated data structures. Our case studies, comparing the automatically parallelized to the non-parallelized HLS implementations, show an average latency reduction by a factor of 2.5 across our benchmarks.
现代fpga的功能允许将日益复杂的应用映射到可重构的硬件中。高级综合(High-level synthesis, HLS)通过将设计入口的抽象级别提高到C/ c++等高级语言,有望显著缩短FPGA设计周期。然而,使用动态、基于指针的数据结构和动态内存分配的应用程序仍然很难很好地实现,但这些结构在软件中广泛使用。由于缺乏对基于指针的内存访问的自动分析,旨在通过将应用程序数据分布在单独的片上存储器上来利用fpga增加的内存带宽的自动优化在存在动态数据结构的情况下通常是无效的。在这项工作中,我们朝着缩小这一差距迈出了一步。我们提出了一个指针操作程序的静态分析,该程序自动将堆分配的数据结构拆分为不相交的独立区域。该分析利用了分离逻辑的最新进展,分离逻辑是一种用于推断堆分配数据的理论框架,已成功地应用于最近的软件验证工具。我们的算法专注于在循环中访问的动态数据结构,并伴随着自动化的源到源转换,通过现成的HLS工具实现自动循环并行化和内存分区。我们使用三个实际应用程序来构建、遍历、更新和处置动态分配的数据结构,通过我们的工具流演示了成功的循环并行化和内存分区。我们的案例研究将自动并行化的HLS实现与非并行化的HLS实现进行了比较,结果显示,在我们的基准测试中,平均延迟减少了2.5倍。
{"title":"Separation Logic-Assisted Code Transformations for Efficient High-Level Synthesis","authors":"F. Winterstein, Samuel Bayliss, G. Constantinides","doi":"10.1109/FCCM.2014.11","DOIUrl":"https://doi.org/10.1109/FCCM.2014.11","url":null,"abstract":"The capabilities of modern FPGAs permit the mapping of increasingly complex applications into reconfigurable hardware. High-level synthesis (HLS) promises a significant shortening of the FPGA design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. Applications using dynamic, pointer-based data structures and dynamic memory allocation, however, remain difficult to implement well, yet such constructs are widely used in software. Automated optimizations that aim to leverage the increased memory bandwidth of FPGAs by distributing the application data over separate banks of on-chip memory are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis of pointer-based memory accesses. In this work, we take a step towards closing this gap. We present a static analysis for pointer-manipulating programs which automatically splits heap-allocated data structures into disjoint, independent regions. The analysis leverages recent advances in separation logic, a theoretical framework for reasoning about heap-allocated data which has been successfully applied in recent software verification tools. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable automatic loop parallelization and memory partitioning by off-the-shelf HLS tools. We demonstrate the successful loop parallelization and memory partitioning by our tool flow using three real-life applications which build, traverse, update and dispose dynamically allocated data structures. Our case studies, comparing the automatically parallelized to the non-parallelized HLS implementations, show an average latency reduction by a factor of 2.5 across our benchmarks.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126467772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Using Multi-op Instructions as a Way to Generate ASIPs with Optimized Pipeline Structure 使用多操作指令生成具有优化管道结构的api
Y. Ben-Asher, Irina Lipov, V. Tartakovsky, Dror Tiv
We propose automatic synthesis of application specific instruction set processors (ASIPs). We use pipeline execution of multi-op machine-instructions, e.g., *(reg1*reg2) = (*reg3) + (*reg4) (C-syntax) an instruction with three memory stages and two arithmetic stages pipeline. The problem is, for a given set of loops, to find a pipeline configuration and a multi-op ISA that maximizes the IPC (instructions per cycle) while minimizing the resource usage and the cost of interconnections to the register-file of the resulting CPU. The algorithm is based on finding an efficient cover of a large graph by a small set of convex sub-graphs gis that are consistent with a given structure of a pipeline. Unlike previous works, gis are not synthesized to circuits that are executed in a co-processor mode but rather both gis and the rest of the program are executed by the same set of multiop pipeline units. In this way we eliminate the overhead associated with the co-processor mode of regular ASIPs but maintain high values of IPC of these ASIPs. The main advantage of using pipeline execution of multi-op versus VLIW instructions is shown to be the cost of interconnections between the CPU's execution units and the register file. Thus, we devise a grading function that for each possible multi-op pipeline configuration balance between the expected IPC (Instructions Per Cycle) and the complexity of the interconnections. Using this grading function we show that in most cases the VLIW configuration is not always the best choice.
我们提出了应用特定指令集处理器(asip)的自动合成。我们使用流水线执行多操作机器指令,例如,*(reg1*reg2) = (*reg3) + (*reg4) (c语法)一条指令具有三个内存阶段和两个算术阶段的流水线。问题是,对于给定的一组循环,找到一个管道配置和一个多操作ISA,使IPC(每周期指令)最大化,同时最小化资源使用和连接到最终CPU的寄存器文件的成本。该算法基于一组与给定管道结构一致的小凸子图来找到一个大图的有效覆盖。与以前的工作不同,gis不是合成成在协处理器模式下执行的电路,而是gis和程序的其余部分由同一组多操作管道单元执行。通过这种方式,我们消除了与常规api的协处理器模式相关的开销,但保持了这些api的高IPC值。与VLIW指令相比,使用管道执行多操作指令的主要优势在于CPU执行单元和寄存器文件之间的互连成本。因此,我们设计了一个分级函数,用于每个可能的多操作管道配置,在预期的IPC(每周期指令)和互连的复杂性之间取得平衡。使用这个分级函数,我们发现在大多数情况下,VLIW配置并不总是最好的选择。
{"title":"Using Multi-op Instructions as a Way to Generate ASIPs with Optimized Pipeline Structure","authors":"Y. Ben-Asher, Irina Lipov, V. Tartakovsky, Dror Tiv","doi":"10.1109/FCCM.2014.16","DOIUrl":"https://doi.org/10.1109/FCCM.2014.16","url":null,"abstract":"We propose automatic synthesis of application specific instruction set processors (ASIPs). We use pipeline execution of multi-op machine-instructions, e.g., *(reg1*reg2) = (*reg3) + (*reg4) (C-syntax) an instruction with three memory stages and two arithmetic stages pipeline. The problem is, for a given set of loops, to find a pipeline configuration and a multi-op ISA that maximizes the IPC (instructions per cycle) while minimizing the resource usage and the cost of interconnections to the register-file of the resulting CPU. The algorithm is based on finding an efficient cover of a large graph by a small set of convex sub-graphs gis that are consistent with a given structure of a pipeline. Unlike previous works, gis are not synthesized to circuits that are executed in a co-processor mode but rather both gis and the rest of the program are executed by the same set of multiop pipeline units. In this way we eliminate the overhead associated with the co-processor mode of regular ASIPs but maintain high values of IPC of these ASIPs. The main advantage of using pipeline execution of multi-op versus VLIW instructions is shown to be the cost of interconnections between the CPU's execution units and the register file. Thus, we devise a grading function that for each possible multi-op pipeline configuration balance between the expected IPC (Instructions Per Cycle) and the complexity of the interconnections. Using this grading function we show that in most cases the VLIW configuration is not always the best choice.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123930048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1