首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
An Approach to a Fully Automated Partial Reconfiguration Design Flow 一种全自动部分重构设计流程的方法
Kizheppatt Vipin, Suhaib A. Fahmy
Adoption of partial reconfiguration (PR) in mainstream FPGA system design remains underwhelming primarily due the significant FPGA design expertise that is required. We present an approach to fully automating a design flow that accepts a high level description of a dynamically adaptive application and generates a fully functional, optimised PR design. This tool can determine the most suitable FPGA for a design to meet a given reconfiguration time constraint and makes full use of available resources. The flow targets adaptive systems, where the dynamic behaviour and switching order are not known up front.
在主流FPGA系统设计中采用部分重构(PR)仍然不太令人印象深刻,主要是因为需要大量的FPGA设计专业知识。我们提出了一种完全自动化设计流程的方法,该流程接受动态自适应应用程序的高级描述,并生成功能齐全的优化公关设计。该工具可以为设计确定最合适的FPGA,以满足给定的重构时间限制,并充分利用可用资源。流的目标是自适应系统,其中动态行为和切换顺序事先不知道。
{"title":"An Approach to a Fully Automated Partial Reconfiguration Design Flow","authors":"Kizheppatt Vipin, Suhaib A. Fahmy","doi":"10.1109/FCCM.2013.33","DOIUrl":"https://doi.org/10.1109/FCCM.2013.33","url":null,"abstract":"Adoption of partial reconfiguration (PR) in mainstream FPGA system design remains underwhelming primarily due the significant FPGA design expertise that is required. We present an approach to fully automating a design flow that accepts a high level description of a dynamically adaptive application and generates a fully functional, optimised PR design. This tool can determine the most suitable FPGA for a design to meet a given reconfiguration time constraint and makes full use of available resources. The flow targets adaptive systems, where the dynamic behaviour and switching order are not known up front.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124215577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An FPGA-Based Data Flow Engine for Gaussian Copula Model 基于fpga的高斯Copula模型数据流引擎
Huabin Ruan, Xiaomeng Huang, H. Fu, Guangwen Yang, W. Luk, S. Racanière, O. Pell, Wenji Han
The Gaussian Copula Model (GCM) plays an important role in the state-of-the-art financial analysis field for modeling the dependence of financial assets. However, the existing implementations of GCM are all computationallydemanding and time-consuming. In this paper, we propose a Dataflow Engine (DFE) design to accelerate the GCM computation. Specifically, a commonly used CPU-friendly GCM algorithm is converted into a fully-pipelined dataflow graph through four steps of optimization: recomposing the algorithm to be pipeline-friendly, removing unnecessary computation, sharing common computing results, and reducing the computing precision while maintaining the same level of accuracy for the computation results. The performance of the proposed DFE design is compared with three CPU-based implementations that are well-optimized. Experimental results show that our DFE solution not only generates fairly accurate result, but also achieves a maximum of 467x speedup over a single-thread CPU-based solution, 120x speedup over a multi-thread CPUbased solution, and 47x speedup over an MPI-based solution.
高斯Copula模型(Gaussian Copula Model, GCM)用于对金融资产的依赖性进行建模,在当前金融分析领域中发挥着重要作用。然而,现有的GCM实现都是计算要求高且耗时的。在本文中,我们提出了一个数据流引擎(DFE)设计来加速GCM的计算。具体而言,将一种常用的cpu友好型GCM算法通过重组算法使其对管道友好、去除不必要的计算、共享公共计算结果、降低计算精度同时保持计算结果的相同精度四个优化步骤,转化为全流水线的数据流图。将所提出的DFE设计与三种优化良好的基于cpu的实现进行了性能比较。实验结果表明,我们的DFE解决方案不仅产生了相当准确的结果,而且在基于单线程cpu的解决方案上实现了467倍的加速,在基于多线程cpu的解决方案上实现了120倍的加速,在基于mpi的解决方案上实现了47倍的加速。
{"title":"An FPGA-Based Data Flow Engine for Gaussian Copula Model","authors":"Huabin Ruan, Xiaomeng Huang, H. Fu, Guangwen Yang, W. Luk, S. Racanière, O. Pell, Wenji Han","doi":"10.1109/FCCM.2013.14","DOIUrl":"https://doi.org/10.1109/FCCM.2013.14","url":null,"abstract":"The Gaussian Copula Model (GCM) plays an important role in the state-of-the-art financial analysis field for modeling the dependence of financial assets. However, the existing implementations of GCM are all computationallydemanding and time-consuming. In this paper, we propose a Dataflow Engine (DFE) design to accelerate the GCM computation. Specifically, a commonly used CPU-friendly GCM algorithm is converted into a fully-pipelined dataflow graph through four steps of optimization: recomposing the algorithm to be pipeline-friendly, removing unnecessary computation, sharing common computing results, and reducing the computing precision while maintaining the same level of accuracy for the computation results. The performance of the proposed DFE design is compared with three CPU-based implementations that are well-optimized. Experimental results show that our DFE solution not only generates fairly accurate result, but also achieves a maximum of 467x speedup over a single-thread CPU-based solution, 120x speedup over a multi-thread CPUbased solution, and 47x speedup over an MPI-based solution.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124959465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High-Level Description and Synthesis of Floating-Point Accumulators on FPGA FPGA上浮点累加器的高级描述与综合
Marc-André Daigneault, J. David
Decades of research in the field of high level hardware description now result in tools that are able to automatically transform C/C++ constructs into highly optimized parallel and pipelined architectures. Such approaches work fine when the control flow is a priory known since the computation results in a large dataflow graph that can be mapped into the available operators. Nevertheless, some applications have a control flow that is highly dependant on the data. This paper focuses on the hardware implementation of such applications and presents a high level synthesis methodology applied to a Hardware Description Language (HDL) in which assignments correspond to self-synchronized connections between predefined data streaming sources and sinks. A data transfer occurs over an established connection when both source and sink are ready, according to their synchronization interfaces. Founded on a high-level communicating FSM programming model, the language allows the user to describe and dynamically modify streaming architectures exploiting spatial and temporal parallelism. Our compiler attempts to maximize the number of transfers at each clock cycle and automatically fixes the potential combinatorial loops induced by the dynamic connection of dependant sources and sinks. The methodology is applied to the synthesis of a pipelined floating point accumulator using the Delayed-Buffering (DB) reduction method. The results we obtain are similar to state-of-the-art dedicated architectures but require much less design time and expertise.
在高级硬件描述领域几十年的研究现在产生了能够自动将C/ c++结构转换为高度优化的并行和流水线体系结构的工具。当控制流是已知的优先级时,这种方法可以很好地工作,因为计算结果是一个可以映射到可用操作符的大型数据流图。然而,一些应用程序具有高度依赖于数据的控制流。本文着重于此类应用的硬件实现,并提出了一种应用于硬件描述语言(HDL)的高级综合方法,其中分配对应于预定义数据流源和汇之间的自同步连接。当源和接收都准备好时,根据它们的同步接口,在已建立的连接上进行数据传输。该语言基于高级通信FSM编程模型,允许用户描述和动态修改利用空间和时间并行性的流架构。我们的编译器试图最大化每个时钟周期的传输数量,并自动修复由依赖源和汇的动态连接引起的潜在组合环路。将该方法应用于使用延迟缓冲(DB)缩减方法的流水线浮点累加器的合成。我们得到的结果类似于最先进的专用架构,但需要更少的设计时间和专业知识。
{"title":"High-Level Description and Synthesis of Floating-Point Accumulators on FPGA","authors":"Marc-André Daigneault, J. David","doi":"10.1109/FCCM.2013.37","DOIUrl":"https://doi.org/10.1109/FCCM.2013.37","url":null,"abstract":"Decades of research in the field of high level hardware description now result in tools that are able to automatically transform C/C++ constructs into highly optimized parallel and pipelined architectures. Such approaches work fine when the control flow is a priory known since the computation results in a large dataflow graph that can be mapped into the available operators. Nevertheless, some applications have a control flow that is highly dependant on the data. This paper focuses on the hardware implementation of such applications and presents a high level synthesis methodology applied to a Hardware Description Language (HDL) in which assignments correspond to self-synchronized connections between predefined data streaming sources and sinks. A data transfer occurs over an established connection when both source and sink are ready, according to their synchronization interfaces. Founded on a high-level communicating FSM programming model, the language allows the user to describe and dynamically modify streaming architectures exploiting spatial and temporal parallelism. Our compiler attempts to maximize the number of transfers at each clock cycle and automatically fixes the potential combinatorial loops induced by the dynamic connection of dependant sources and sinks. The methodology is applied to the synthesis of a pipelined floating point accumulator using the Delayed-Buffering (DB) reduction method. The results we obtain are similar to state-of-the-art dedicated architectures but require much less design time and expertise.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116418531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Parallel Computation of Skyline Queries Skyline查询的并行计算
L. Woods, G. Alonso, J. Teubner
Due to stagnant clock speeds and high power consumption of commodity microprocessors, database vendors have started to explore massively parallel co-processors such as FPGAs to further increase performance. A typical approach is to push simple but compute-intensive operations (e.g., prefiltering, (de)compression) to FPGAs for acceleration. In this paper, we show how a significantly more complex operation- the computation of the skyline-can be holistically implemented on an FPGA. A skyline query computes the pareto optimal set of multi-dimensional data points. These queries have been studied in software extensively over the last decade but this paper is the first to examine skyline computation in hardware. We propose a methodology that interleaves data storage and computation, allowing multiple operations to be executed on the same working set in parallel, while accounting for all data dependencies. Our experiments show that we achieve very promising results compared to CPU-based solutions.
由于商用微处理器的时钟速度停滞不前和高功耗,数据库供应商已经开始探索大规模并行协处理器,如fpga,以进一步提高性能。一种典型的方法是将简单但计算密集型的操作(例如,预滤波,(解)压缩)推到fpga上来加速。在本文中,我们展示了如何在FPGA上整体实现一个更复杂的操作-天际线的计算。skyline查询计算多维数据点的帕累托最优集。在过去的十年中,这些查询已经在软件中得到了广泛的研究,但本文是第一次在硬件中研究天际线计算。我们提出了一种交叉数据存储和计算的方法,允许在同一工作集中并行执行多个操作,同时考虑所有数据依赖性。我们的实验表明,与基于cpu的解决方案相比,我们获得了非常有希望的结果。
{"title":"Parallel Computation of Skyline Queries","authors":"L. Woods, G. Alonso, J. Teubner","doi":"10.1109/FCCM.2013.18","DOIUrl":"https://doi.org/10.1109/FCCM.2013.18","url":null,"abstract":"Due to stagnant clock speeds and high power consumption of commodity microprocessors, database vendors have started to explore massively parallel co-processors such as FPGAs to further increase performance. A typical approach is to push simple but compute-intensive operations (e.g., prefiltering, (de)compression) to FPGAs for acceleration. In this paper, we show how a significantly more complex operation- the computation of the skyline-can be holistically implemented on an FPGA. A skyline query computes the pareto optimal set of multi-dimensional data points. These queries have been studied in software extensively over the last decade but this paper is the first to examine skyline computation in hardware. We propose a methodology that interleaves data storage and computation, allowing multiple operations to be executed on the same working set in parallel, while accounting for all data dependencies. Our experiments show that we achieve very promising results compared to CPU-based solutions.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124398941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Reconfigurable Acceleration of Short Read Mapping 短读映射的可重构加速
James Arram, K. H. Tsoi, W. Luk, P. Jiang
Recent improvements in the throughput of nextgeneration DNA sequencing machines poses a great computational challenge in analysing the massive quantities of data produced. This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence. Our approach consists of two key components: an exact string matcher for the bulk of the alignment process, and an approximate string matcher for the remaining cases. We characterise interesting regions of the design space, including homogeneous, heterogeneous and run-time reconfigurable designs and provide back of envelope estimations of the corresponding performance. We show that a particular implementation of this architecture targeting a single FPGA can be up to 293 times faster than BWA on an Intel X5650 CPU, and 134 times faster than SOAP3 on an NVIDIA GTX 580 GPU.
下一代DNA测序机的吞吐量最近有所提高,这对分析产生的大量数据提出了巨大的计算挑战。本文提出了一种基于可重构计算技术的新方法,用于加速短读映射,其中数百万个短读的位置相对于已知参考序列进行定位。我们的方法由两个关键组件组成:用于大部分对齐过程的精确字符串匹配器,以及用于其余情况的近似字符串匹配器。我们描述了设计空间中有趣的区域,包括同构的、异构的和运行时可重构的设计,并提供了相应性能的粗略估计。我们表明,针对单个FPGA的这种架构的特定实现可以比Intel X5650 CPU上的BWA快293倍,比NVIDIA GTX 580 GPU上的SOAP3快134倍。
{"title":"Reconfigurable Acceleration of Short Read Mapping","authors":"James Arram, K. H. Tsoi, W. Luk, P. Jiang","doi":"10.1109/FCCM.2013.57","DOIUrl":"https://doi.org/10.1109/FCCM.2013.57","url":null,"abstract":"Recent improvements in the throughput of nextgeneration DNA sequencing machines poses a great computational challenge in analysing the massive quantities of data produced. This paper proposes a novel approach, based on reconfigurable computing technology, for accelerating short read mapping, where the positions of millions of short reads are located relative to a known reference sequence. Our approach consists of two key components: an exact string matcher for the bulk of the alignment process, and an approximate string matcher for the remaining cases. We characterise interesting regions of the design space, including homogeneous, heterogeneous and run-time reconfigurable designs and provide back of envelope estimations of the corresponding performance. We show that a particular implementation of this architecture targeting a single FPGA can be up to 293 times faster than BWA on an Intel X5650 CPU, and 134 times faster than SOAP3 on an NVIDIA GTX 580 GPU.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131028315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
An FPGA Based PCI-E Root Complex Architecture for Standalone SOPCs 基于FPGA的独立单片机PCI-E根复合体结构
Yingjie Cao, Yongxin Zhu, Xu Wang, Jiang Jiang, Meikang Qiu
We present an FPGA (field programmable gate array) based PCI-E (PCI-Express) root complex architecture for SOPCs (System-on-a-Programmable-Chip) in this paper. In our work, the system on the FPGA serves as a PCIE master device rather than a PCIE endpoint, which is usually a common practice as a co-processing device driven by a desktop computer or a server. We use this system to control a PCIE endpoint, which is also an FPGA based endpoint implemented on another FPGA board. This architecture requires only IP cores free of charge. We also provide basic software driver so that specific device driver can be developed on it to control popular PCIE device in the future, i.e. ethernet card or graphic card. The whole architecture has been implemented on Xilinx Virtex-6 FPGAs to indicate that this architecture is a feasible approach to standalone SOPCs, which has better efficiencies than those with additional generic controlling processors.
本文提出了一种基于FPGA(现场可编程门阵列)的PCI-E (PCI-Express)根复合体结构,用于SOPCs(单可编程芯片系统)。在我们的工作中,FPGA上的系统充当PCIE主设备而不是PCIE端点,这通常是由台式计算机或服务器驱动的协同处理设备的常见做法。我们使用该系统来控制PCIE端点,该端点也是在另一块FPGA板上实现的基于FPGA的端点。这种架构只需要免费的IP核。我们还提供了基本的软件驱动程序,以便在其上开发特定的设备驱动程序,以控制未来流行的PCIE设备,即以太网卡或图形卡。整个体系结构已在Xilinx Virtex-6 fpga上实现,表明该体系结构是独立sopc的可行方法,其效率优于带有额外通用控制处理器的sopc。
{"title":"An FPGA Based PCI-E Root Complex Architecture for Standalone SOPCs","authors":"Yingjie Cao, Yongxin Zhu, Xu Wang, Jiang Jiang, Meikang Qiu","doi":"10.1109/FCCM.2013.29","DOIUrl":"https://doi.org/10.1109/FCCM.2013.29","url":null,"abstract":"We present an FPGA (field programmable gate array) based PCI-E (PCI-Express) root complex architecture for SOPCs (System-on-a-Programmable-Chip) in this paper. In our work, the system on the FPGA serves as a PCIE master device rather than a PCIE endpoint, which is usually a common practice as a co-processing device driven by a desktop computer or a server. We use this system to control a PCIE endpoint, which is also an FPGA based endpoint implemented on another FPGA board. This architecture requires only IP cores free of charge. We also provide basic software driver so that specific device driver can be developed on it to control popular PCIE device in the future, i.e. ethernet card or graphic card. The whole architecture has been implemented on Xilinx Virtex-6 FPGAs to indicate that this architecture is a feasible approach to standalone SOPCs, which has better efficiencies than those with additional generic controlling processors.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131262923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Automating Elimination of Idle Functions by Run-Time Reconfiguration 通过运行时重新配置自动消除空闲函数
Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu, O. Pell
A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.
在优化资源利用率的同时,提出了一种自动识别和利用运行时重构机会的设计方法。我们介绍了可重构数据流图,这是一种分层图结构,使可重构设计能够分三个步骤进行综合:功能分析、配置组织和运行时解决方案生成。基于障碍期权定价、粒子滤波和逆时迁移的三种应用被用于评估所提出的方法。运行时解决方案通过消除空闲函数来接近理论性能,并且比优化的静态设计快1.31到2.19倍。采用该方法开发的FPGA设计比优化的CPU参考设计快28.8倍,比优化的GPU设计快1.55倍。
{"title":"Automating Elimination of Idle Functions by Run-Time Reconfiguration","authors":"Xinyu Niu, T. Chau, Qiwei Jin, W. Luk, Qiang Liu, O. Pell","doi":"10.1145/2700415","DOIUrl":"https://doi.org/10.1145/2700415","url":null,"abstract":"A design approach is proposed to automatically identify and exploit run-time reconfiguration opportunities while optimising resource utilisation. We introduce Reconfiguration Data Flow Graph, a hierarchical graph structure enabling reconfigurable designs to be synthesised in three steps: function analysis, configuration organisation, and run-time solution generation. Three applications, based on barrier option pricing, particle filter, and reverse time migration are used in evaluating the proposed approach. The run-time solutions approximate the theoretical performance by eliminating idle functions, and are 1.31 to 2.19 times faster than optimised static designs. FPGA designs developed with the proposed approach are up to 28.8 times faster than optimised CPU reference designs and 1.55 times faster than optimised GPU designs.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131437428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A Multithreaded VLIW Soft Processor Family 一个多线程VLIW软处理器家族
Kalin Ovtcharov, Ilian Tili, J. Steffan
Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in th
只提供摘要形式。使用fpga进行计算加速的商业兴趣越来越大。为了减轻非硬件专家程序员的编程任务,可以将C和OpenCL等高级语言映射到针对fpga的编译器生成电路和软处理引擎的系统正在出现。像cpu这样的软处理引擎是程序员所熟悉的,可以在不重建FPGA映像的情况下快速重新编程,并且由于其一般性质,可以在比多个功能合成电路的替代方案更小的区域内支持多个软件功能。最后,引人注目的处理引擎可以被纳入高级合成系统的输出。为了使基于fpga的软计算引擎引人注目,它们必须计算密集:它们必须实现每个区域的高吞吐量。对于具有简单功能单元(FU)的简单cpu来说,实现良好的利用率是相对简单的,如果一个小的、单管道级的FU(如整数加法器)没有得到充分利用,也不会造成太大的损害。相比之下,更大、更深入的流水线化、更多数量和更多样化的FUs可能非常难以保持忙碌——即使对于能够从应用程序中提取指令级并行性(ILP)的引擎也是如此。因此,基于fpga的计算引擎面临的一个关键挑战是,如何通过实现由多个具有显著和不同管道深度的不同FUs组成的数据路径的高利用率来最大化计算密度(每区域的吞吐量)。在这项工作中,我们提出了一个基于fpga的多线程计算引擎的高度可参数化的模板架构,旨在高度利用各种深度流水线化的FUs。我们实现高利用率的方法是利用(i)对多线程上下文的支持(ii)线程级和指令级并行性,以及(iii)静态编译器分析和调度。我们专注于深度流水线,IEEE-754浮点FUs具有广泛的延迟,执行Hodgkin-Huxley神经元模拟和Black-Scholes期权定价模型作为示例应用程序,使用我们基于llvm的调度程序编译。针对Stratix IV FPGA,我们通过测量具有不同数量的FUs,线程上下文(T),内存库(B)和银行多端口的设计的面积和吞吐量来探索架构权衡。为了确定适合复制的最有效的设计,我们测量了计算密度(FPGA面积单位的应用程序吞吐量),并报告了哪些架构选择导致了最计算密度的设计。计算密度最高的设计不一定是具有最高吞吐量的设计,并且(i)为了最大限度地提高吞吐量,让每个线程驻留在自己的线程组中是最好的;(ii)当只有中等数量的独立线程可用时,计算引擎比自定义硬件实现具有更高的计算密度。, 2.3倍,32线程;(iii)最佳的傅里叶混合不一定符合应用程序数据流图中的傅里叶使用情况;(四)建筑参数。
{"title":"A Multithreaded VLIW Soft Processor Family","authors":"Kalin Ovtcharov, Ilian Tili, J. Steffan","doi":"10.1109/FCCM.2013.36","DOIUrl":"https://doi.org/10.1109/FCCM.2013.36","url":null,"abstract":"Summary form only given. There is growing commercial interest in using FPGAs for compute acceleration. To ease the programming task for non-hardware-expert programmers, systems are emerging that can map high-level languages such as C and OpenCL to FPGAs-targeting compiler-generated circuits and soft processing engines. Soft processing engines such as CPUs are familiar to programmers, can be reprogrammed quickly without rebuilding the FPGA image, and by their general nature can support multiple software functions in a smaller area than the alternative of multiple per-function synthesized circuits. Finally, compelling processing engines can be incorporated into the output of high-level synthesis systems. For FPGA-based soft compute engines to be compelling they must be computationally dense: they must achieve high throughput per area. For simple CPUs with simple functional units (FUs) it is relatively straightforward to achieve good utilization, and it is not overly-detrimental if a small, single-pipeline-stage FU such as an integer adder is under-utilized. In contrast, larger, more deeply pipelined, more numerous, and more varied FUs can be quite challenging to keep busy-even for an engine capable of extracting instruction-level parallelism (ILP) from an application. Hence a key challenge for FPGA-based compute engines is how to maximize compute density (throughput per-area) by achieving high utilization of a datapath composed of multiple varying FUs of significant and varying pipeline depth. In this work, we propose a highly-parameterizable template architecture of a multi-threaded FPGA-based compute engine designed to highly-utilize varied and deeply pipelined FUs. Our approach to achieving high utilization is to leverage (i) support for multiple thread contexts (ii) thread-level and instruction-level parallelism, and (iii) static compiler analysis and scheduling. We focus on deeply-pipelined, IEEE-754 floating-point FUs of widely-varying latency, executing both Hodgkin-Huxley neuron simulation and Black-Scholes options pricing models as example applications, compiled with our LLVM-based scheduler. Targeting a Stratix IV FPGA, we explore architectural tradeoffs by measuring area and throughput for designs with varying numbers of FUs, thread contexts (T), memory banks (B), and bank multi-porting. To determine the most efficient designs that would be suitable for replicating we measure compute density (application throughput per unit of FPGA area), and report which architectural choices lead to the most computationally-dense designs.The most computationally dense design is not necessarily the one with highest throughput and (i) for maximizing throughput, having each thread reside in its own bank is best; (ii) when only moderate numbers of independent threads are available, the compute engine has higher compute density than a custom hardware implementation eg., 2.3x for 32 threads; (iii) the best FU mix does not necessarily match the FU usage in th","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133863245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Image Segmentation Using Hardware Forest Classifiers
Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram
Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel basis, thereby binding the performance to the size and resolution of the image. As applications require greater resolution and larger images the computational resources of this step can quickly exceed those of available CPUs, especially in the power and thermal constrained areas of consumer electronics and mobile. In this work, we use a hardware tree-based classifier to solve the image segmentation problem. The application is background removal (BGR) from depth-maps obtained from the Microsoft Kinect sensor. After the image is segmented, subsequent steps then classify the objects in the scene. The approach is flexible: to address different application domains we only need to change the trees used by the classifiers. We describe two distinct approaches and evaluate their performance using the commercial-grade testing environment used for the Microsoft Xbox gaming console.
图像分割是将图像划分为像素段或像素子集以进行进一步分析的过程,例如将前景中感兴趣的对象与背景中不感兴趣的对象分开。在许多图像处理应用中,该过程需要以每个像素为基础的一系列计算步骤,从而将性能与图像的大小和分辨率绑定在一起。由于应用程序需要更高的分辨率和更大的图像,这一步的计算资源可能很快超过可用的cpu,特别是在消费电子和移动设备的功率和热受限领域。在这项工作中,我们使用基于硬件树的分类器来解决图像分割问题。该应用程序是从微软Kinect传感器获得的深度图中去除背景(BGR)。在对图像进行分割之后,接下来的步骤就是对场景中的物体进行分类。这种方法很灵活:要处理不同的应用程序域,我们只需要更改分类器使用的树。我们描述了两种不同的方法,并使用用于Microsoft Xbox游戏机的商业级测试环境评估它们的性能。
{"title":"Image Segmentation Using Hardware Forest Classifiers","authors":"Richard Neil Pittman, A. Forin, A. Criminisi, J. Shotton, A. Mahram","doi":"10.1109/FCCM.2013.20","DOIUrl":"https://doi.org/10.1109/FCCM.2013.20","url":null,"abstract":"Image segmentation is the process of partitioning an image into segments or subsets of pixels for purposes of further analysis, such as separating the interesting objects in the foreground from the un-interesting objects in the background. In many image processing applications, the process requires a sequence of computational steps on a per pixel basis, thereby binding the performance to the size and resolution of the image. As applications require greater resolution and larger images the computational resources of this step can quickly exceed those of available CPUs, especially in the power and thermal constrained areas of consumer electronics and mobile. In this work, we use a hardware tree-based classifier to solve the image segmentation problem. The application is background removal (BGR) from depth-maps obtained from the Microsoft Kinect sensor. After the image is segmented, subsequent steps then classify the objects in the scene. The approach is flexible: to address different application domains we only need to change the trees used by the classifiers. We describe two distinct approaches and evaluate their performance using the commercial-grade testing environment used for the Microsoft Xbox gaming console.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123974294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Open-Source Bitstream Generation 开源比特流生成
Ritesh Soni, Neil Steiner, M. French
This work presents an open-source bitstream generation tool for Torc. Bitstream generation has traditionally been the single part of the FPGA design flow that could not be openly reproduced, but our novel approach enables this without reverse-engineering or violating End-User License Agreement terms. We begin by creating a library of “micro-bitstreams” which constitute a collection of primitives at a granularity of our choosing. These primitives can then be combined to create larger designs, or portions thereof, with simple merging operations. Our effort is motivated by a desire to resume earlier work on embedded bitstream generation and autonomous hardware. This is not feasible with Xilinx bitgen because there is no reasonable way to run an x86 binary with complex library and data dependencies on most embedded systems. Initial support is limited to the Virtex5, but we intend to extend this to other Xilinx architectures. We are able to support nearly all routing resources in the device, as well as the most common logic resources.
这项工作提出了一个开源的Torc比特流生成工具。传统上,比特流生成是FPGA设计流程的单个部分,无法公开复制,但我们的新方法可以在不进行逆向工程或违反最终用户许可协议条款的情况下实现这一点。我们首先创建一个“微比特流”库,它以我们选择的粒度组成了一组原语。然后可以使用简单的合并操作组合这些原语以创建更大的设计或其中的部分。我们努力的动机是希望恢复早期在嵌入式比特流生成和自主硬件方面的工作。这在Xilinx bitgen中是不可行的,因为在大多数嵌入式系统上没有合理的方法来运行具有复杂库和数据依赖的x86二进制文件。最初的支持仅限于Virtex5,但我们打算将其扩展到其他Xilinx体系结构。我们能够支持设备中几乎所有的路由资源,以及最常见的逻辑资源。
{"title":"Open-Source Bitstream Generation","authors":"Ritesh Soni, Neil Steiner, M. French","doi":"10.1109/FCCM.2013.45","DOIUrl":"https://doi.org/10.1109/FCCM.2013.45","url":null,"abstract":"This work presents an open-source bitstream generation tool for Torc. Bitstream generation has traditionally been the single part of the FPGA design flow that could not be openly reproduced, but our novel approach enables this without reverse-engineering or violating End-User License Agreement terms. We begin by creating a library of “micro-bitstreams” which constitute a collection of primitives at a granularity of our choosing. These primitives can then be combined to create larger designs, or portions thereof, with simple merging operations. Our effort is motivated by a desire to resume earlier work on embedded bitstream generation and autonomous hardware. This is not feasible with Xilinx bitgen because there is no reasonable way to run an x86 binary with complex library and data dependencies on most embedded systems. Initial support is limited to the Virtex5, but we intend to extend this to other Xilinx architectures. We are able to support nearly all routing resources in the device, as well as the most common logic resources.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129189278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1