2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文中文

Automating Optimization of Reconfigurable Designs 可重构设计的自动化优化

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.65

Maciej Kurek, Tobias Becker, T. Chau, W. Luk

We present Automatic Reconfigurable Design Efficient Global Optimization (ARDEGO), a new algorithm based on the existing Efficient Global Optimization (EGO) methodology for automating optimization of reconfigurable designs targeting Field-Programmable Gate Array (FPGA) technology. It is a potentially disruptive design approach: instead of manually improving designs repeatedly but without understanding the design space as a whole, ARDEGO users follow a novel approach that: (a) automates the manual optimization process, significantly reducing optimization time and (b) does not require the user to calibrate or understand the inner workings of the algorithm. We evaluate ARDEGO using two case studies: financial option pricing and seismic imaging.

本文提出了一种基于现有的高效全局优化(EGO)方法的自动可重构设计高效全局优化(ARDEGO)算法，用于针对现场可编程门阵列(FPGA)技术的可重构设计的自动优化。这是一种潜在的颠覆性设计方法:ARDEGO用户不需要在不了解整个设计空间的情况下反复手动改进设计，而是采用一种新颖的方法:(a)自动化手动优化过程，显著减少优化时间;(b)不需要用户校准或理解算法的内部工作原理。我们使用两个案例研究来评估ARDEGO:金融期权定价和地震成像。

引用次数: 17

Customizable Compression Architecture for Efficient Configuration in CGRAs 可定制的压缩体系结构在CGRAs中的高效配置

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.18

Syed M. A. H. Jafri, Muhammad Adeel Tajammul, M. Daneshtalab, A. Hemani, K. Paul, P. Ellervee, J. Plosila, H. Tenhunen

Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.

今天，粗粒度可重构架构(CGRAs)承载多个应用程序。新的CGRAs允许每个应用程序利用运行时并行性和时间共享。虽然这些特性提高了功率和硅效率，但它们显著增加了配置内存开销。为了解决这个问题，研究人员采用了统计压缩、中间压缩表示和组播。这些技术中的每一种都有不同的特性，因此最适合于特定类型的应用程序。然而，现有的研究只是分别处理这些方法。在本文中，我们提出了一个可变形的压缩架构，将这些技术交织在一个独特的平台上。

引用次数: 2

Automated Partial Reconfiguration Design for Adaptive Systems with CoPR for Zynq 基于Zynq的CoPR自适应系统的自动部分重构设计

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.63

Kizheppatt Vipin, Suhaib A. Fahmy

Dynamically adaptive systems (DAS) respond to environmental conditions, by modifying their processing at runtime and selecting alternative configurations of computation. Field programmable gate arrays, with their support for partial reconfiguration (PR) represent an ideal platform for implementing such systems. Designing partially reconfigurable systems has traditionally been a difficult task requiring FPGA expertise. This paper presents a fully automated framework for implementing PR based adaptive systems. The designer specifies a set of valid configurations containing instances of modules from a standard library. The tool automates partitioning of modules into regions, floorplanning regions on the FPGA fabric, and generation of bitstreams. A runtime system manages the loading of bitstreams automatically through API calls.

动态自适应系统(DAS)通过在运行时修改其处理和选择可选的计算配置来响应环境条件。现场可编程门阵列支持部分重构(PR)，是实现此类系统的理想平台。设计部分可重构系统历来是一项需要FPGA专业知识的艰巨任务。本文提出了一个实现基于PR的自适应系统的全自动框架。设计器指定一组有效的配置，其中包含来自标准库的模块实例。该工具自动将模块划分为区域，FPGA结构上的平面规划区域，并生成比特流。运行时系统通过API调用自动管理比特流的加载。

引用次数: 11

Fast and Power Efficient Heapsort IP for Image Compression Application 快速和高效的堆排序IP图像压缩应用程序

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.72

Yuhui Bai, S. Z. Ahmed, B. Granado

We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Öktem image coder [1]. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabołotny's work [2] as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA9.

我们提出了一种堆排序算法的硬件架构，该排序被用于基于小波的图像编码器的子带编码块，称为Öktem图像编码器[1]。虽然这种编码器提供了良好的图像质量，但排序耗时，并且是特定于应用程序的，因为在子带编码中对不同数量的数据重复使用排序，因此在运行时难以扩展具有固定排序容量的简单硬件实现。为了解决这个问题，必须考虑时间/功率效率和分拣大小的灵活性。我们在Zabołotny[2]的基础上提出了一种改进的FPGA堆排序架构，作为图像编码器的IP加速器。通过使用自适应层启用元素，提出了一种可配置的体系结构，以便在运行时调整排序容量，从而有效地对不同数量的数据进行排序。在自适应内存关闭的情况下，我们改进的架构与基线实现相比，内存功耗降低了20.9%。此外，与ARM cortex相比，我们的架构提供了13倍的加速。

{"title":"Fast and Power Efficient Heapsort IP for Image Compression Application","authors":"Yuhui Bai, S. Z. Ahmed, B. Granado","doi":"10.1109/FCCM.2014.72","DOIUrl":"https://doi.org/10.1109/FCCM.2014.72","url":null,"abstract":"We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Öktem image coder [1]. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabołotny's work [2] as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA9.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120928705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Scalable Multi-engine Xpress9 Compressor with Asynchronous Data Transfer 具有异步数据传输的可扩展多引擎Xpress9压缩器

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.49

Joo-Young Kim, S. Hauck, D. Burger

Data compression is crucial in large-scale storage servers to save both storage and network bandwidth, but it suffers from high computational cost. In this work, we present a high throughput FPGA based compressor as a PCIe accelerator to achieve CPU resource saving and high power efficiency. The proposed compressor is differentiated from previous hardware compressors by the following features: 1) targeting Xpress9 algorithm, whose compression quality is comparable to the best Gzip implementation (level 9); 2) a scalable multi-engine architecture with various IP blocks to handle algorithmic complexity as well as to achieve high throughput; 3) supporting a heavily multi-threaded server environment with an asynchronous data transfer interface between the host and the accelerator. The implemented Xpress9 compressor on Altera Stratix V GS performs 1.6-2.4Gbps throughput with 7 engines on various compression benchmarks, supporting up to 128 thread contexts.

在大型存储服务器中，数据压缩对于节省存储和网络带宽至关重要，但其计算成本较高。在这项工作中，我们提出了一个基于FPGA的高吞吐量压缩器作为PCIe加速器，以实现CPU资源的节省和高功耗效率。该压缩器与以往的硬件压缩器有以下特点:1)针对Xpress9算法，其压缩质量可与最好的Gzip实现(9级)相媲美;2)具有不同IP块的可扩展多引擎架构，以处理算法复杂性并实现高吞吐量;3)支持重度多线程服务器环境，在主机和加速器之间提供异步数据传输接口。在Altera Stratix V GS上实现的Xpress9压缩器在不同的压缩基准下，在7个引擎上执行1.6-2.4Gbps的吞吐量，支持多达128个线程上下文。

引用次数: 12

Integrated CUDA-to-FPGA Synthesis with Network-on-Chip 集成CUDA-to-FPGA合成与片上网络

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.14

S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen

Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).

CUDA和Open CL等数据并行语言有效地描述了许多并行计算线程，HLS工具可以有效地将这些描述转化为独立的优化内核。随着实例化内核数量的增加，平均外部内存访问延迟可能是影响系统性能的一个重要因素。然而，尽管每个核心独立产生输出，但核心通常大量共享输入数据。利用片上数据共享既可以减少外部带宽需求，又可以提高平均内存访问延迟，从而使系统在相同数量的内核下提高性能。在本文中，我们开发了一个片上网络，并结合了由CUDA合成的fpga计算核心，使片上数据共享成为可能。我们证明外部带宽需求减少了60%(平均56%)，总应用程序周期延迟减少了43%(平均27%)。

引用次数: 6

A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor 内置块RAM辐射粒子传感器的fpga自适应SEU缓解系统

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.79

R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener

In this paper, we propose a self-adaptive FPGA-based, partially reconfigurable system for space missions in order to mitigate Single Event Upsets in the FPGA configuration and fabric. Dynamic reconfiguration is used here for an on-demand replication of modules in dependence of current and changing radiation levels. More precisely, the idea is to trigger a redundancy scheme such as Dual Modular Redundancy or Triple Modular Redundancy in response to a continuously monitored Single Event Upset rate measured inside the on-chip memories itself, e.g., any subset (even used) internal Block RAMs. Depending on the current radiation level, the minimal number of replicas is determined at runtime under the constraint that a required Safety Integrity Level for a module is ensured and configured accordingly. For signal processing applications it is shown that this autonomous adaption to the different solar conditions realizes a resource efficient mitigation. In our case study, we show that it is possible to triplicate the data throughput at the Solar Maximum condition (no flares) compared to a Triple Modular Redundancy implementation of a single module. We also show the decreasing Probability of Failures Per Hour by 2 × 104 at flare-enhanced conditions compared with a non-redundant system.

在本文中，我们提出了一种基于FPGA的自适应、部分可重构的空间任务系统，以减轻FPGA配置和结构中的单事件干扰。动态重新配置在这里用于依赖于当前和不断变化的辐射水平的模块的按需复制。更准确地说，这个想法是触发一个冗余方案，如双模块冗余或三模块冗余，以响应一个连续监测的单事件打乱率测量片上存储器本身，例如，任何子集(甚至使用)内部块ram。根据当前的辐射级别，在确保模块所需的安全完整性级别并进行相应配置的约束下，在运行时确定副本的最小数量。对于信号处理应用，表明这种对不同太阳条件的自主适应实现了资源高效缓解。在我们的案例研究中，我们表明，与单个模块的三重模块冗余实现相比，在太阳极大期条件下(无耀斑)有可能将数据吞吐量增加三倍。我们还表明，与非冗余系统相比，在耀斑增强条件下每小时的故障概率降低了2 × 104。

{"title":"A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor","authors":"R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener","doi":"10.1109/FCCM.2014.79","DOIUrl":"https://doi.org/10.1109/FCCM.2014.79","url":null,"abstract":"In this paper, we propose a self-adaptive FPGA-based, partially reconfigurable system for space missions in order to mitigate Single Event Upsets in the FPGA configuration and fabric. Dynamic reconfiguration is used here for an on-demand replication of modules in dependence of current and changing radiation levels. More precisely, the idea is to trigger a redundancy scheme such as Dual Modular Redundancy or Triple Modular Redundancy in response to a continuously monitored Single Event Upset rate measured inside the on-chip memories itself, e.g., any subset (even used) internal Block RAMs. Depending on the current radiation level, the minimal number of replicas is determined at runtime under the constraint that a required Safety Integrity Level for a module is ensured and configured accordingly. For signal processing applications it is shown that this autonomous adaption to the different solar conditions realizes a resource efficient mitigation. In our case study, we show that it is possible to triplicate the data throughput at the Solar Maximum condition (no flares) compared to a Triple Modular Redundancy implementation of a single module. We also show the decreasing Probability of Failures Per Hour by 2 × 104 at flare-enhanced conditions compared with a non-redundant system.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124882374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Speeding Up FPGA Placement: Parallel Algorithms and Methods 加速FPGA布局:并行算法和方法

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.60

Ma An, J. Gregory Steffan, Vaughn Betz

Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises “move fairness” and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.

现在放置一个大型FPGA设计通常需要几个小时，这极大地阻碍了设计人员的工作效率。此外，FPGA容量的增长速度快于CPU速度，这将进一步增加放置时间，除非找到新的方法。然而，多核处理器现在无处不在，而且一些最新的处理器还具有对事务性内存(TM)的硬件支持，这使得并行成为加速放置的一种越来越有吸引力的方法。研究了在FPGA研究中广泛应用的VPR模拟退火算法的并行化方法。我们探讨了算法的变化以及不同并行编程范式和硬件的使用，包括TM、线程级推测(TLS)和无锁技术。我们发现硬件TM可以实现较大的加速(平均8.1倍)，但会损害“移动公平性”并导致不可接受的质量损失。TLS的可扩展性很差，最大加速为2.2倍，但保留了质量。一种新的依赖检查并行策略实现了最佳平衡:确定性版本实现了5.9倍的加速，没有质量损失，而非确定性、无锁的版本可以扩展到34倍的加速。

{"title":"Speeding Up FPGA Placement: Parallel Algorithms and Methods","authors":"Ma An, J. Gregory Steffan, Vaughn Betz","doi":"10.1109/FCCM.2014.60","DOIUrl":"https://doi.org/10.1109/FCCM.2014.60","url":null,"abstract":"Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises “move fairness” and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack 云中的fpga:使用OpenStack启动虚拟化硬件加速器

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.42

Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow

We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (opensource cloud software), thereby allowing users to “boot” custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.

我们提出了一种新的方法，将基于fpga的虚拟化硬件加速器集成到商业规模的云计算系统中，并且虚拟化开销最小。跨多个fpga的部分可重构区域通过OpenStack(开源云软件)作为通用云资源提供，从而允许用户“启动”定制设计或预定义的网络连接硬件加速器，使用与启动常规虚拟机相同的命令。我们提出了一个硬件和软件框架来实现这种虚拟化。这是将fpga与现有的云计算模型紧密结合的第一次尝试，在现有的云计算模型中，资源是虚拟化的、灵活的，并且具有无限的可扩展性。我们的系统平均可以在2.6秒内建立和拆除虚拟加速器，比普通虚拟机快得多。与非虚拟化系统相比，物理fpga上的静态虚拟化硬件只会导致加速器中每个数据包的三个周期延迟增加和一个周期管道延迟。我们提出了一个案例研究，分析了应用程序级负载平衡器的设计和性能，使用了我们系统的完全实现原型。我们的研究表明，FPGA云计算资源可以轻松超越虚拟机，而系统的虚拟化和抽象显着减少了设计迭代时间和设计复杂性。

{"title":"FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack","authors":"Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow","doi":"10.1109/FCCM.2014.42","DOIUrl":"https://doi.org/10.1109/FCCM.2014.42","url":null,"abstract":"We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (opensource cloud software), thereby allowing users to “boot” custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 229

Look-up Table Design for Deep Sub-threshold through Full-Supply Operation 全供深次阈值查表设计

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

Pub Date : 1900-01-01 DOI: 10.1109/FCCM.2014.80

M. Abusultan, S. Khatri

Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA lookup table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80×) lower power and a (~4×) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel global (spacial) as well as local (random) PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA LUT to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8× and 2.9× respectively. The dynamic body biasing circuits incur a 3.49% area overhead when designed to each drive a cluster of 25 LUTs.

现场可编程门阵列(fpga)是设计灵活性的首选实现平台。然而，fpga的高功耗(由于其灵活的结构而产生)使它们对极低功耗应用的吸引力降低。在本文中，我们提出了一种FPGA查找表(LUT)的设计，其目标是在宽电源电压带上无缝运行。同样的LUT设计能够在需要低功耗时在亚阈值电压下工作，在需要更快性能时在更高电压下工作。结果表明，对于采用22nm预测技术实现的6输入LUT，在亚阈值模式下工作的LUT比全电源电压工作的功率低(~ 80x)，能量低(~ 4x)。亚阈值操作的主要缺点是易受工艺、温度和电源电压(PVT)变化的影响。本文还介绍了一种闭环自适应体偏置机构的设计和实验结果，该机构可动态消除全局(空间)和局部(随机)PVT变化。对于相同的22nm技术，我们证明了闭环自适应体偏置电路可以允许FPGA LUT在跨越一个数量级(40 MHz至1300 MHz)的工作频率范围内工作。我们还表明，闭环自适应体偏置电路可以消除由电源电压变化引起的延迟变化，并将工艺变化对设置和保持时间的影响分别减少1.8倍和2.9倍。当设计为每个驱动25个lut集群时，动态体偏置电路会产生3.49%的面积开销。

{"title":"Look-up Table Design for Deep Sub-threshold through Full-Supply Operation","authors":"M. Abusultan, S. Khatri","doi":"10.1109/FCCM.2014.80","DOIUrl":"https://doi.org/10.1109/FCCM.2014.80","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA lookup table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80×) lower power and a (~4×) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel global (spacial) as well as local (random) PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA LUT to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8× and 2.9× respectively. The dynamic body biasing circuits incur a 3.49% area overhead when designed to each drive a cluster of 25 LUTs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116788610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀