首页 > 最新文献

2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration 当Spark遇到fpga:下一代DNA测序加速的案例研究
Yu-Ting Chen, J. Cong, Zhenman Fang, Jie Lei, Peng Wei
FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement, and captured a great amount of attention from both academia and industry. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks? Although very important, this problem has not been well studied, especially for the integration of fine-grained FPGA accelerators that have short execution time but will be invoked many times. To provide a generalized methodology and insight for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions of integration at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.
支持fpga的数据中心在提供性能和能效改进方面显示出了巨大的潜力,并引起了学术界和工业界的极大关注。在本文中,我们的目标是回答一个关键问题:我们如何有效地将fpga集成到最先进的大数据计算框架中?虽然这个问题非常重要,但目前还没有得到很好的研究,特别是对于执行时间短但会被多次调用的细粒度FPGA加速器的集成。为了提供有效集成的通用方法和见解,我们深入分析了单线程、单节点多线程和多节点级别集成的挑战和相应的解决方案。通过对下一代DNA测序应用程序的逐步案例研究,我们演示了如何将具有1000倍减速的直接集成调整为具有2.6倍整体系统加速和2.4倍能效改进的高效集成。
{"title":"When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration","authors":"Yu-Ting Chen, J. Cong, Zhenman Fang, Jie Lei, Peng Wei","doi":"10.1109/FCCM.2016.18","DOIUrl":"https://doi.org/10.1109/FCCM.2016.18","url":null,"abstract":"FPGA-enabled datacenters have shown great potential for providing performance and energy efficiency improvement, and captured a great amount of attention from both academia and industry. In this paper we aim to answer one key question: how can we efficiently integrate FPGAs into state-of-the-art big-data computing frameworks? Although very important, this problem has not been well studied, especially for the integration of fine-grained FPGA accelerators that have short execution time but will be invoked many times. To provide a generalized methodology and insight for efficient integration, we conduct an in-depth analysis of challenges and corresponding solutions of integration at single-thread, single-node multi-thread, and multi-node levels. With a step-by-step case study for the next-generation DNA sequencing application, we demonstrate how a straightforward integration with 1000x slowdown can be tuned into an efficient integration with 2.6x overall system speedup and 2.4x energy efficiency improvement.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126957043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
FPGA-Accelerated Particle-Grid Mapping fpga加速粒子网格映射
A. Sanaullah, Arash Khoshparvar, M. Herbordt
Computing the forces derived from long-range electrostatics is a critical application and also a central part of Molecular Dynamics. Part of that computation, the transformation of a charge grid to a potential grid via a 3D FFT, has received some attention recently and has been found to work extremely well on FPGAs. Here we report on the rest of the computation, which consists of two mappings: charges onto a grid and a potential grid onto the particles. These mappings are interesting in their own right as they are far more compute intensive than the FFTs; each is typically done using tricubic interpolation. We believe that these mappings have been studied only once previously for FPGAs and then found to be exorbitantly expensive; i.e., only bicubic would lit on the chip. In the current work we lind that, when using the Altera Arria 10, not only do both mappings lit, but also an appropriately sized 3D FFT. This enables the building of a balanced accelerator for the entire long-range electrostatics computation on a single FPGA. This design scales directly to FPGA clusters. Other contributions include a new mapping scheme based on table lookup and a measure of the utility of the floating point support of the Arria-10.
计算远程静电产生的力是分子动力学的关键应用,也是分子动力学的核心部分。该计算的一部分,通过3D FFT将电荷网格转换为电位网格,最近受到了一些关注,并发现在fpga上工作得非常好。在这里,我们报告了计算的其余部分,它包括两个映射:电荷到网格和电位到粒子的网格。这些映射本身就很有趣,因为它们比fft计算密集得多;每一个都是典型的使用三次插值。我们认为,这些映射以前只对fpga进行过一次研究,然后发现它们过于昂贵;也就是说,只有双立方会在芯片上发光。在目前的工作中,我们发现,当使用Altera Arria 10时,不仅可以点亮两个映射,还可以适当大小的3D FFT。这使得在单个FPGA上为整个远程静电计算构建平衡加速器成为可能。该设计可直接扩展到FPGA集群。其他贡献包括基于表查找的新映射方案和对Arria-10浮点支持效用的度量。
{"title":"FPGA-Accelerated Particle-Grid Mapping","authors":"A. Sanaullah, Arash Khoshparvar, M. Herbordt","doi":"10.1109/FCCM.2016.53","DOIUrl":"https://doi.org/10.1109/FCCM.2016.53","url":null,"abstract":"Computing the forces derived from long-range electrostatics is a critical application and also a central part of Molecular Dynamics. Part of that computation, the transformation of a charge grid to a potential grid via a 3D FFT, has received some attention recently and has been found to work extremely well on FPGAs. Here we report on the rest of the computation, which consists of two mappings: charges onto a grid and a potential grid onto the particles. These mappings are interesting in their own right as they are far more compute intensive than the FFTs; each is typically done using tricubic interpolation. We believe that these mappings have been studied only once previously for FPGAs and then found to be exorbitantly expensive; i.e., only bicubic would lit on the chip. In the current work we lind that, when using the Altera Arria 10, not only do both mappings lit, but also an appropriately sized 3D FFT. This enables the building of a balanced accelerator for the entire long-range electrostatics computation on a single FPGA. This design scales directly to FPGA clusters. Other contributions include a new mapping scheme based on table lookup and a measure of the utility of the floating point support of the Arria-10.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128076045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs 通过OpenCL弥合fpga的性能可编程性差距:OpenDwarfs的案例研究
K. Krommydas, A. Helal, Anshuman Verma, Wu-chun Feng
For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use a hardware design language (HDL) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance, that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate the effect of various optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite.
几十年来,fpga的流架构在许多应用领域提供了加速的性能,例如金融中的期权定价求解器,石油和天然气中的计算流体动力学,以及网络路由器和防火墙中的数据包处理。然而,这种性能是以可编程性为代价的,即性能-可编程性差距。特别是,FPGA开发人员使用硬件设计语言(HDL)来实现应用程序数据路径,并设计用于计算管道,内存管理,同步和通信的硬件模块。这个过程需要对目标FPGA架构有广泛的低级知识,并消耗大量的开发时间和精力。为了解决fpga缺乏可编程性的问题,OpenCL为cpu、gpu、apu以及现在的fpga提供了一种易于使用和可移植的编程模型。然而,这种显著改进的可编程性可能以牺牲性能为代价,也就是说,仍然存在性能可编程性差距。为了提高OpenCL内核在fpga上的性能,从而弥合性能可编程性的差距,我们应用并评估了各种优化技术对GEM的影响,GEM是OpenDwarfs基准套件中的n体方法。
{"title":"Bridging the Performance-Programmability Gap for FPGAs via OpenCL: A Case Study with OpenDwarfs","authors":"K. Krommydas, A. Helal, Anshuman Verma, Wu-chun Feng","doi":"10.1109/FCCM.2016.56","DOIUrl":"https://doi.org/10.1109/FCCM.2016.56","url":null,"abstract":"For decades, the streaming architecture of FPGAs has delivered accelerated performance across many application domains, such as option pricing solvers in finance, computational fluid dynamics in oil and gas, and packet processing in network routers and firewalls. However, this performance has come at the significant expense of programmability, i.e., the performance-programmability gap. In particular, FPGA developers use a hardware design language (HDL) to implement the application data path and to design hardware modules for computation pipelines, memory management, synchronization, and communication. This process requires extensive low-level knowledge of the target FPGA architecture and consumes significant development time and effort. To address this lack of programmability of FPGAs, OpenCL provides an easy-to-use and portable programming model for CPUs, GPUs, APUs, and now, FPGAs. However, this significantly improved programmability can come at the expense of performance, that is, there still remains a performance-programmability gap. To improve the performance of OpenCL kernels on FPGAs, and thus, bridge the performance-programmability gap, we apply and evaluate the effect of various optimization techniques on GEM, an N-body method from the OpenDwarfs benchmark suite.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124284781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A Multi-ported Memory Compiler Utilizing True Dual-Port BRAMs 一个使用真正双端口bram的多端口内存编译器
Ameer Abdelhadi, G. Lemieux
Recent work has shown how multi-ported RAMs can be built out of dual-ported RAMs. Such techniques combine two structures: a set of "data banks" to hold the data, and a method for selecting the bank containing the last-written data, often called a live-value table (LVT). Most previous work has focused on the design of the LVT to reduce area and improve performance. In this paper, we instead reduce area by optimizing the design of the "data banks" portion. The optimization is embedded into a memory compiler that solves a set cover problem. When the set cover problem is solved optimally, the data banks use minimum area. Our technique applies to multi-ported RAMs that have a structural pattern we describe as "switched ports". Switched ports are a generalization of true ports, where a certain number of write ports can be dynamically switched into a possibly different number of read ports using one common read/write control signal. Furthermore, a given application may have multiple sets, each set with a different read/write control. While previous work generates multi-port RAM solutions that contain only true ports, or only simple ports, we contend that using only these two models is too limiting and prevents optimizations from being applied. Experimental results on 10 random instances of multi-port RAMs show 17% BRAM reduction on average compared to the best of other approaches. The compiler and a fully parameterized Verilog implementation is released as an open source library. The library has been extensively tested using Altera's EDA tools.
最近的工作已经展示了如何在双端口ram的基础上构建多端口ram。这种技术结合了两种结构:一组保存数据的“数据库”,以及一种选择包含最后写入数据的数据库的方法,通常称为活值表(live-value table, LVT)。以前的大部分工作都集中在LVT的设计上,以减少面积和提高性能。在本文中,我们通过优化“数据库”部分的设计来减少面积。优化被嵌入到一个内存编译器中,该编译器解决了一个集合覆盖问题。当集合覆盖问题得到最优解决时,数据库使用的面积最小。我们的技术适用于具有我们称之为“交换端口”的结构模式的多端口ram。交换端口是真端口的泛化,其中一定数量的写端口可以使用一个公共读写控制信号动态切换到可能不同数量的读端口。此外,给定的应用程序可能有多个集合,每个集合具有不同的读/写控制。虽然以前的工作产生的多端口RAM解决方案仅包含真正的端口,或仅包含简单的端口,但我们认为仅使用这两种模型太过限制,并且阻碍了优化的应用。在10个随机多端口ram实例上的实验结果表明,与其他最佳方法相比,平均减少了17%的BRAM。编译器和完全参数化的Verilog实现作为开源库发布。该库已经使用Altera的EDA工具进行了广泛的测试。
{"title":"A Multi-ported Memory Compiler Utilizing True Dual-Port BRAMs","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1109/FCCM.2016.45","DOIUrl":"https://doi.org/10.1109/FCCM.2016.45","url":null,"abstract":"Recent work has shown how multi-ported RAMs can be built out of dual-ported RAMs. Such techniques combine two structures: a set of \"data banks\" to hold the data, and a method for selecting the bank containing the last-written data, often called a live-value table (LVT). Most previous work has focused on the design of the LVT to reduce area and improve performance. In this paper, we instead reduce area by optimizing the design of the \"data banks\" portion. The optimization is embedded into a memory compiler that solves a set cover problem. When the set cover problem is solved optimally, the data banks use minimum area. Our technique applies to multi-ported RAMs that have a structural pattern we describe as \"switched ports\". Switched ports are a generalization of true ports, where a certain number of write ports can be dynamically switched into a possibly different number of read ports using one common read/write control signal. Furthermore, a given application may have multiple sets, each set with a different read/write control. While previous work generates multi-port RAM solutions that contain only true ports, or only simple ports, we contend that using only these two models is too limiting and prevents optimizations from being applied. Experimental results on 10 random instances of multi-port RAMs show 17% BRAM reduction on average compared to the best of other approaches. The compiler and a fully parameterized Verilog implementation is released as an open source library. The library has been extensively tested using Altera's EDA tools.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121885031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Tinker: Generating Custom Memory Architectures for Altera's OpenCL Compiler 为Altera的OpenCL编译器生成自定义内存架构
D. Richmond, Jeremy Blackstone, Matthew Hogains, Kevin Thai, R. Kastner
Tools for C/C++ based-hardware development have grown in popularity in recent years. However, the impact of these tools has been limited by their lack of support for integration with vendor IP, external memories, and communication peripherals. In this paper we introduce Tinker, an open-source Board Support Package generator for Altera's OpenCL Compiler. Board Support Packages define memory, communication, and IP ports for easy integration with high level synthesis cores. Tinker abstracts the low-level hardware details of hardware development when creating board support packages and greatly increases the flexibility of OpenCL development. Tinker currently generates custom memory architectures from user specifications. We use our tool to generate a variety of architectures and apply them to two application kernels.
近年来,基于C/ c++的硬件开发工具越来越受欢迎。然而,这些工具的影响受到限制,因为它们缺乏与供应商IP、外部存储器和通信外设集成的支持。在本文中,我们介绍了Tinker,一个用于Altera的OpenCL编译器的开源板支持包生成器。板支持包定义内存,通信和IP端口,以便与高级合成核心轻松集成。Tinker在创建板支持包时抽象了硬件开发的底层硬件细节,大大增加了OpenCL开发的灵活性。Tinker目前根据用户规范生成自定义内存架构。我们使用我们的工具生成各种架构,并将它们应用于两个应用程序内核。
{"title":"Tinker: Generating Custom Memory Architectures for Altera's OpenCL Compiler","authors":"D. Richmond, Jeremy Blackstone, Matthew Hogains, Kevin Thai, R. Kastner","doi":"10.1109/FCCM.2016.13","DOIUrl":"https://doi.org/10.1109/FCCM.2016.13","url":null,"abstract":"Tools for C/C++ based-hardware development have grown in popularity in recent years. However, the impact of these tools has been limited by their lack of support for integration with vendor IP, external memories, and communication peripherals. In this paper we introduce Tinker, an open-source Board Support Package generator for Altera's OpenCL Compiler. Board Support Packages define memory, communication, and IP ports for easy integration with high level synthesis cores. Tinker abstracts the low-level hardware details of hardware development when creating board support packages and greatly increases the flexibility of OpenCL development. Tinker currently generates custom memory architectures from user specifications. We use our tool to generate a variety of architectures and apply them to two application kernels.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131365853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ECO Based Placement and Routing Framework for 3D FPGAs with Micro-fluidic Cooling 基于ECO的微流体冷却3D fpga布局和布线框架
Zhiyuan Yang, Caleb Serafy, Ankur Srivastava
Integrated micro-fluidic (MF) cooling is a promising technique to solve the thermal problems in 3D FPGAs [1] (As shown in Figure 1). However, this cooling method has some nonideal properties such as non-uniform heat removal capacity along the flow direction. Existing 3D FPGA placement and routing (P&R) tools are unaware of micro-fluidic cooling, thus leading to large on-chip temperature variation which is harmful to the reliability of 3D FPGAs. In this paper we demonstrate that we can incorporate micro-fluidic cooling considerations in existing 3D FPGA P&R tools simply with a cooling-aware Engineering Change Order (ECO) based placement framework. Taking the placement result of an existing P&R tool, the framework modifies the node positions to improve the on-chip temperature uniformity accounting for fluidic cooling structures. Hence we do not need to invest in a stand alone fluidic cooling aware 3D FPGA CAD framework.
集成微流体(MF)冷却是解决三维fpga热问题的一种很有前途的技术[1](如图1所示)。然而,这种冷却方法存在一些不理想的特性,如沿流动方向的散热能力不均匀。现有的3D FPGA放置和布线(P&R)工具没有意识到微流体冷却,从而导致芯片上的温度变化很大,这不利于3D FPGA的可靠性。在本文中,我们证明了我们可以将微流体冷却考虑纳入现有的3D FPGA P&R工具中,只需使用基于冷却感知的工程变更令(ECO)的放置框架。该框架以现有P&R工具的放置结果为基础,在考虑流体冷却结构的情况下,修改节点位置以提高片上温度均匀性。因此,我们不需要投资于一个独立的流体冷却意识三维FPGA CAD框架。
{"title":"ECO Based Placement and Routing Framework for 3D FPGAs with Micro-fluidic Cooling","authors":"Zhiyuan Yang, Caleb Serafy, Ankur Srivastava","doi":"10.1109/FCCM.2016.57","DOIUrl":"https://doi.org/10.1109/FCCM.2016.57","url":null,"abstract":"Integrated micro-fluidic (MF) cooling is a promising technique to solve the thermal problems in 3D FPGAs [1] (As shown in Figure 1). However, this cooling method has some nonideal properties such as non-uniform heat removal capacity along the flow direction. Existing 3D FPGA placement and routing (P&R) tools are unaware of micro-fluidic cooling, thus leading to large on-chip temperature variation which is harmful to the reliability of 3D FPGAs. In this paper we demonstrate that we can incorporate micro-fluidic cooling considerations in existing 3D FPGA P&R tools simply with a cooling-aware Engineering Change Order (ECO) based placement framework. Taking the placement result of an existing P&R tool, the framework modifies the node positions to improve the on-chip temperature uniformity accounting for fluidic cooling structures. Hence we do not need to invest in a stand alone fluidic cooling aware 3D FPGA CAD framework.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130916923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RP-Ring: A Heterogeneous Multi-FPGA Accelerating Solution for N-Body Simulations RP-Ring: n体模拟的异构多fpga加速解决方案
Tianqi Wang, Xi Jin, Bo Peng, Chuanjun Wang, Linlin Zheng
We propose an heterogeneous multi-FPGA accelerating solution, which is called as RP-ring (Reconfigurable Processor ring), for direct-summation N-body simulation. In this solution, we try to use existing FPGA boards rather than design new specialized boards to reduce cost. It can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resource. In order to prevent the slowest board from dragging the overall performance down, we build a mathematical model to decompose workload among FPGAs. The model divide workload based on the logic resource, memory access bandwidth and communication bandwidth of each FPGA chip. We apply the solution in astrodynamics simulation and achieve two orders of magnitude speedup compared with CPU implementations.
我们提出了一种异构多fpga加速解决方案,称为rp环(可重构处理器环),用于直接求和n体仿真。在这个解决方案中,我们尝试使用现有的FPGA板,而不是设计新的专用板来降低成本。它可以方便地扩展到任何可用的FPGA板上,并且FPGA板之间的通信带宽很低。该通信协议简单,可以在有限的硬件/软件资源下实现。为了防止最慢的电路板拖累整体性能,我们建立了一个数学模型来分解fpga之间的工作负载。该模型根据每块FPGA芯片的逻辑资源、内存访问带宽和通信带宽来划分工作负载。我们将该解决方案应用于天体动力学仿真,与CPU实现相比,速度提高了两个数量级。
{"title":"RP-Ring: A Heterogeneous Multi-FPGA Accelerating Solution for N-Body Simulations","authors":"Tianqi Wang, Xi Jin, Bo Peng, Chuanjun Wang, Linlin Zheng","doi":"10.1109/FCCM.2016.20","DOIUrl":"https://doi.org/10.1109/FCCM.2016.20","url":null,"abstract":"We propose an heterogeneous multi-FPGA accelerating solution, which is called as RP-ring (Reconfigurable Processor ring), for direct-summation N-body simulation. In this solution, we try to use existing FPGA boards rather than design new specialized boards to reduce cost. It can be expanded conveniently with any available FPGA board and only requires quite low communication bandwidth between FPGA boards. The communication protocol is simple and can be implemented with limited hardware/software resource. In order to prevent the slowest board from dragging the overall performance down, we build a mathematical model to decompose workload among FPGAs. The model divide workload based on the logic resource, memory access bandwidth and communication bandwidth of each FPGA chip. We apply the solution in astrodynamics simulation and achieve two orders of magnitude speedup compared with CPU implementations.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132252418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High Performance Instruction Scheduling Circuits for Out-of-Order Soft Processors 无序软处理器的高性能指令调度电路
Henry Wong, Vaughn Betz, Jonathan Rose
Soft processors have a role to play in easing the difficulty of designing applications into FPGAs for two reasons: first, they can be deployed only when needed, unlike permanent on-die hard processors. Second, for the portions of an application that can function sufficiently fast on a soft processor, it is far easier to write and debug single-threaded software code than to create hardware. The breadth of this second role increases when the performance of the soft processor increases, yet there has been little progress in the performance of soft processors since their commercial inception -- in particular, the sophisticated out-of-order superscalar approaches that arrived in the mid 1990s are not employed, despite the fact that their area cost is now easily tolerable. In this paper we take an important step towards out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate. This differs from the hard-processor design problem because the logic substrate is restricted to LUTs, whereas hard processor scheduling circuits employ CAM and wired-OR structures to great benefit. We discuss both circuit and microarchitectural trade-offs, and compare three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler. With this circuit, large schedulers up to 40 entries can be built with the same cycle time as the commercial Nios II/f soft processor (240~MHz). This careful design has the potential to significantly increase both the IPC and raw compute performance of a soft processor, compared to current commercial soft processors.
软处理器在缓解将应用程序设计成fpga的困难方面发挥了作用,原因有两个:首先,它们可以只在需要时部署,不像永久的硬处理器。其次,对于在软处理器上运行速度足够快的应用程序部分,编写和调试单线程软件代码要比创建硬件容易得多。当软处理器的性能提高时,第二个角色的广度也会增加,但是自从软处理器开始商业化以来,它的性能几乎没有进步——特别是,20世纪90年代中期出现的复杂的无序超标量方法没有被采用,尽管它们的面积成本现在很容易接受。在本文中,我们通过探索FPGA衬底中的指令调度,向软处理器中的乱序执行迈出了重要的一步。这与硬处理器设计问题不同,因为逻辑基板仅限于lut,而硬处理器调度电路采用CAM和有线或结构,从而受益匪浅。我们讨论了电路和微架构的权衡,并比较了调度器的三种电路结构,包括一种称为融合逻辑矩阵调度器的新结构。利用该电路,可以构建多达40个条目的大型调度器,其周期时间与商用Nios II/f软处理器(240~MHz)相同。与当前的商用软处理器相比,这种精心设计有可能显著提高软处理器的IPC和原始计算性能。
{"title":"High Performance Instruction Scheduling Circuits for Out-of-Order Soft Processors","authors":"Henry Wong, Vaughn Betz, Jonathan Rose","doi":"10.1109/FCCM.2016.11","DOIUrl":"https://doi.org/10.1109/FCCM.2016.11","url":null,"abstract":"Soft processors have a role to play in easing the difficulty of designing applications into FPGAs for two reasons: first, they can be deployed only when needed, unlike permanent on-die hard processors. Second, for the portions of an application that can function sufficiently fast on a soft processor, it is far easier to write and debug single-threaded software code than to create hardware. The breadth of this second role increases when the performance of the soft processor increases, yet there has been little progress in the performance of soft processors since their commercial inception -- in particular, the sophisticated out-of-order superscalar approaches that arrived in the mid 1990s are not employed, despite the fact that their area cost is now easily tolerable. In this paper we take an important step towards out-of-order execution in soft processors by exploring instruction scheduling in an FPGA substrate. This differs from the hard-processor design problem because the logic substrate is restricted to LUTs, whereas hard processor scheduling circuits employ CAM and wired-OR structures to great benefit. We discuss both circuit and microarchitectural trade-offs, and compare three circuit structures for the scheduler, including a new structure called a fused-logic matrix scheduler. With this circuit, large schedulers up to 40 entries can be built with the same cycle time as the commercial Nios II/f soft processor (240~MHz). This careful design has the potential to significantly increase both the IPC and raw compute performance of a soft processor, compared to current commercial soft processors.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133480221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs FPGA设计中在线预估每个模块功率的系统辨识方法
Eddie Hung, James J. Davis, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides
In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to ±5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.
在现代FPGA片上系统设计中,仅仅通过设计时间估计或运行时功率轨测量来评估整个电路的总功耗往往是不够的。相反,为了做出更好的运行时决策,最好了解系统中每个单独模块所消耗的功率。在这项工作中,我们将板级功率测量与寄存器级活动计数相结合,建立了一个在线模型,该模型可以在设计中产生功耗细分。在线模型改进避免了耗时的表征阶段的需要,并且还允许模型跟踪操作条件的长期变化。我们的流程被命名为KAPow,这是用于功率估计的“K”计数活动的缩写,我们证明它是准确的,每个模块的功率估计接近于真实测量值的±5mW,并且开销低。我们还演示了一个应用程序示例,其中可以使用每个模块的电源分解来确定任务到模块的有效映射,并将系统范围的功耗降低8%以上。
{"title":"KAPow: A System Identification Approach to Online Per-Module Power Estimation in FPGA Designs","authors":"Eddie Hung, James J. Davis, Joshua M. Levine, Edward A. Stott, P. Cheung, G. Constantinides","doi":"10.1109/FCCM.2016.25","DOIUrl":"https://doi.org/10.1109/FCCM.2016.25","url":null,"abstract":"In a modern FPGA system-on-chip design, it is often insufficient to simply assess the total power consumption of the entire circuit by design-time estimation or runtime power rail measurement. Instead, to make better runtime decisions, it is desirable to understand the power consumed by each individual module in the system. In this work, we combine board-level power measurements with register-level activity counting to build an online model that produces a breakdown of power consumption within the design. Online model refinement avoids the need for a time-consuming characterisation stage and also allows the model to track long-term changes to operating conditions. Our flow is named KAPow, a (loose) acronym for 'K'ounting Activity for Power estimation, which we show to be accurate, with per-module power estimates as close to ±5mW of true measurements, and to have low overheads. We also demonstrate an application example in which a per-module power breakdown can be used to determine an efficient mapping of tasks to modules and reduce system-wide power consumption by over 8%.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123668500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs 马拉松:FPGA覆盖noc上的静态调度无冲突路由
Nachiket Kapre
We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.
当以高效、混合的方式将静态调度与分组交换相结合时,我们可以以适度的额外存储成本为代价,将像Hoplite这样的偏折路由FPGA覆盖片上网络(noc)的性能提高多达10倍(随机流量)。偏转路由无缓冲noc,如Hoplite,允许在fpga上极其轻量级的分组交换路由器,但由于拥塞下的偏转而遭受高分组延迟。当通信工作负载事先已知时,时间复用路由可以通过消除偏移提供更快的替代方案,但需要在LUT ram的上下文缓冲区中存储昂贵的路由决策。在本文中,我们提出了一种混合马拉松NoC,它结合了无偏转时间复用路由的低数据包延迟和无上下文分组交换Hoplite NoC的低实现成本。马拉松NoC要求在交换机中实现确定性路由功能,并在pe中实现带时间戳的数据包注入,以确保网络中的路由无偏转。网络还需要一个一次性的离线静态调度阶段,该阶段决定注入数据包的适当时间,以保证共享网络上的无冲突无偏转路由。对于随机交通模式,当考虑相同区域成本下的总通信时间时,马拉松比Hoplite多出10倍,时间复用多出1.2倍。对于其他合成模式,Marathon在除局部模式外的所有情况下都优于Hoplite,并且在大型系统尺寸下的最佳时间复用性能在2 - 5倍之内。对于从现实世界的稀疏矩阵向量乘法内核中提取的通信工作负载,Marathon的性能比Hoplite和Time Multiplexing都高出1.3 - 2.8倍。
{"title":"Marathon: Statically-Scheduled Conflict-Free Routing on FPGA Overlay NoCs","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2016.47","DOIUrl":"https://doi.org/10.1109/FCCM.2016.47","url":null,"abstract":"We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high packet latencies due to deflections under congestion. When the communication workload is known in advance, time-multiplexed routing can offer a faster alternative by eliminating deflections but require expensive storage of routing decisions in context buffers in LUT RAMs. In this paper, we propose a hybrid Marathon NoC that combines the low packet latencies of deflection-free time-multiplexed routing with the low implementation cost of context-free packet-switched Hoplite NoC. The Marathon NoC requires a deterministic routing function to be implemented in the switch along with time-stamped packet injection in the PEs to ensure deflection-free routing in the network. The network also needs a one-time offline static scheduling stage that determines the appropriate time to inject a packet to guarantee conflict-free deflection-free route on the shared network. For random traffic patterns, Marathon outperforms Hoplite by as much as 10× and time multiplexing by as much as 1.2× when considering total communication time at identical area costs. For other synthetic patterns, Marathon outperforms Hoplite in all cases except local pattern and is within 2 - 5× of best time multiplexing performance at large system sizes. For communication workloads extracted from real-world sparse matrix-vector multiplication kernels, Marathon outperforms both Hoplite and Time Multiplexing by 1.3 - 2.8×.","PeriodicalId":113498,"journal":{"name":"2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121577751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1