We present Automatic Reconfigurable Design Efficient Global Optimization (ARDEGO), a new algorithm based on the existing Efficient Global Optimization (EGO) methodology for automating optimization of reconfigurable designs targeting Field-Programmable Gate Array (FPGA) technology. It is a potentially disruptive design approach: instead of manually improving designs repeatedly but without understanding the design space as a whole, ARDEGO users follow a novel approach that: (a) automates the manual optimization process, significantly reducing optimization time and (b) does not require the user to calibrate or understand the inner workings of the algorithm. We evaluate ARDEGO using two case studies: financial option pricing and seismic imaging.
{"title":"Automating Optimization of Reconfigurable Designs","authors":"Maciej Kurek, Tobias Becker, T. Chau, W. Luk","doi":"10.1109/FCCM.2014.65","DOIUrl":"https://doi.org/10.1109/FCCM.2014.65","url":null,"abstract":"We present Automatic Reconfigurable Design Efficient Global Optimization (ARDEGO), a new algorithm based on the existing Efficient Global Optimization (EGO) methodology for automating optimization of reconfigurable designs targeting Field-Programmable Gate Array (FPGA) technology. It is a potentially disruptive design approach: instead of manually improving designs repeatedly but without understanding the design space as a whole, ARDEGO users follow a novel approach that: (a) automates the manual optimization process, significantly reducing optimization time and (b) does not require the user to calibrate or understand the inner workings of the algorithm. We evaluate ARDEGO using two case studies: financial option pricing and seismic imaging.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133995576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Syed M. A. H. Jafri, Muhammad Adeel Tajammul, M. Daneshtalab, A. Hemani, K. Paul, P. Ellervee, J. Plosila, H. Tenhunen
Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.
{"title":"Customizable Compression Architecture for Efficient Configuration in CGRAs","authors":"Syed M. A. H. Jafri, Muhammad Adeel Tajammul, M. Daneshtalab, A. Hemani, K. Paul, P. Ellervee, J. Plosila, H. Tenhunen","doi":"10.1109/FCCM.2014.18","DOIUrl":"https://doi.org/10.1109/FCCM.2014.18","url":null,"abstract":"Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114262083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dynamically adaptive systems (DAS) respond to environmental conditions, by modifying their processing at runtime and selecting alternative configurations of computation. Field programmable gate arrays, with their support for partial reconfiguration (PR) represent an ideal platform for implementing such systems. Designing partially reconfigurable systems has traditionally been a difficult task requiring FPGA expertise. This paper presents a fully automated framework for implementing PR based adaptive systems. The designer specifies a set of valid configurations containing instances of modules from a standard library. The tool automates partitioning of modules into regions, floorplanning regions on the FPGA fabric, and generation of bitstreams. A runtime system manages the loading of bitstreams automatically through API calls.
{"title":"Automated Partial Reconfiguration Design for Adaptive Systems with CoPR for Zynq","authors":"Kizheppatt Vipin, Suhaib A. Fahmy","doi":"10.1109/FCCM.2014.63","DOIUrl":"https://doi.org/10.1109/FCCM.2014.63","url":null,"abstract":"Dynamically adaptive systems (DAS) respond to environmental conditions, by modifying their processing at runtime and selecting alternative configurations of computation. Field programmable gate arrays, with their support for partial reconfiguration (PR) represent an ideal platform for implementing such systems. Designing partially reconfigurable systems has traditionally been a difficult task requiring FPGA expertise. This paper presents a fully automated framework for implementing PR based adaptive systems. The designer specifies a set of valid configurations containing instances of modules from a standard library. The tool automates partitioning of modules into regions, floorplanning regions on the FPGA fabric, and generation of bitstreams. A runtime system manages the loading of bitstreams automatically through API calls.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124177225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Öktem image coder [1]. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabołotny's work [2] as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA9.
{"title":"Fast and Power Efficient Heapsort IP for Image Compression Application","authors":"Yuhui Bai, S. Z. Ahmed, B. Granado","doi":"10.1109/FCCM.2014.72","DOIUrl":"https://doi.org/10.1109/FCCM.2014.72","url":null,"abstract":"We present a hardware architecture of a heapsort algorithm, the sorting is employed in a subband coding block of a wavelet-based image coder termed Öktem image coder [1]. Although this coder provides good image quality, the sorting is time consuming, and is application specific, as the sorting is repetitively used for different volume of data in the subband coding, thus a simple hardware implementation with fixed sorting capacity will be difficult to scale during runtime. To tackle this problem, the time/power efficiency and the sorting size flexibility have to be taken in to account. We proposed an improved FPGA heapsort architecture based on Zabołotny's work [2] as an IP accelerator of the image coder. We present a configurable architecture by using adaptive layer enable elements so the sorting capacity could be adjusted during runtime to efficiently sort different amount of data. With the adaptive memory shutdown, our improved architecture provides up to 20.9% power reduction on the memories compared to the baseline implementation. Moreover, our architecture provides 13x speedup compared to ARM CortexA9.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120928705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data compression is crucial in large-scale storage servers to save both storage and network bandwidth, but it suffers from high computational cost. In this work, we present a high throughput FPGA based compressor as a PCIe accelerator to achieve CPU resource saving and high power efficiency. The proposed compressor is differentiated from previous hardware compressors by the following features: 1) targeting Xpress9 algorithm, whose compression quality is comparable to the best Gzip implementation (level 9); 2) a scalable multi-engine architecture with various IP blocks to handle algorithmic complexity as well as to achieve high throughput; 3) supporting a heavily multi-threaded server environment with an asynchronous data transfer interface between the host and the accelerator. The implemented Xpress9 compressor on Altera Stratix V GS performs 1.6-2.4Gbps throughput with 7 engines on various compression benchmarks, supporting up to 128 thread contexts.
在大型存储服务器中,数据压缩对于节省存储和网络带宽至关重要,但其计算成本较高。在这项工作中,我们提出了一个基于FPGA的高吞吐量压缩器作为PCIe加速器,以实现CPU资源的节省和高功耗效率。该压缩器与以往的硬件压缩器有以下特点:1)针对Xpress9算法,其压缩质量可与最好的Gzip实现(9级)相媲美;2)具有不同IP块的可扩展多引擎架构,以处理算法复杂性并实现高吞吐量;3)支持重度多线程服务器环境,在主机和加速器之间提供异步数据传输接口。在Altera Stratix V GS上实现的Xpress9压缩器在不同的压缩基准下,在7个引擎上执行1.6-2.4Gbps的吞吐量,支持多达128个线程上下文。
{"title":"A Scalable Multi-engine Xpress9 Compressor with Asynchronous Data Transfer","authors":"Joo-Young Kim, S. Hauck, D. Burger","doi":"10.1109/FCCM.2014.49","DOIUrl":"https://doi.org/10.1109/FCCM.2014.49","url":null,"abstract":"Data compression is crucial in large-scale storage servers to save both storage and network bandwidth, but it suffers from high computational cost. In this work, we present a high throughput FPGA based compressor as a PCIe accelerator to achieve CPU resource saving and high power efficiency. The proposed compressor is differentiated from previous hardware compressors by the following features: 1) targeting Xpress9 algorithm, whose compression quality is comparable to the best Gzip implementation (level 9); 2) a scalable multi-engine architecture with various IP blocks to handle algorithmic complexity as well as to achieve high throughput; 3) supporting a heavily multi-threaded server environment with an asynchronous data transfer interface between the host and the accelerator. The implemented Xpress9 compressor on Altera Stratix V GS performs 1.6-2.4Gbps throughput with 7 engines on various compression benchmarks, supporting up to 128 thread contexts.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128495382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen
Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).
{"title":"Integrated CUDA-to-FPGA Synthesis with Network-on-Chip","authors":"S. Gurumani, Jacob Tolar, Yao Chen, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/FCCM.2014.14","DOIUrl":"https://doi.org/10.1109/FCCM.2014.14","url":null,"abstract":"Data parallel languages such as CUDA and Open CL efficiently describe many parallel threads of computation, and HLS tools can effectively translate these descriptions into independent optimized cores. As the number of instantiated cores grows, average external memory access latency can be a significant factor in system performance. However, although each core produces outputs independently, the cores often heavily share input data. Exploiting on-chip data sharing both reduces external bandwidth demand and improves the average memory access latency, allowing the system to improve performance at the same number of cores. In this paper, we develop a network-on-chip coupled with computation cores synthesized from CUDA for FPGAs that enables on-chip data sharing. We demonstrate reduced external bandwidth demand by up to 60% (average 56%) and total application latency in cycles by up to 43% (average 27%).","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127237384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener
In this paper, we propose a self-adaptive FPGA-based, partially reconfigurable system for space missions in order to mitigate Single Event Upsets in the FPGA configuration and fabric. Dynamic reconfiguration is used here for an on-demand replication of modules in dependence of current and changing radiation levels. More precisely, the idea is to trigger a redundancy scheme such as Dual Modular Redundancy or Triple Modular Redundancy in response to a continuously monitored Single Event Upset rate measured inside the on-chip memories itself, e.g., any subset (even used) internal Block RAMs. Depending on the current radiation level, the minimal number of replicas is determined at runtime under the constraint that a required Safety Integrity Level for a module is ensured and configured accordingly. For signal processing applications it is shown that this autonomous adaption to the different solar conditions realizes a resource efficient mitigation. In our case study, we show that it is possible to triplicate the data throughput at the Solar Maximum condition (no flares) compared to a Triple Modular Redundancy implementation of a single module. We also show the decreasing Probability of Failures Per Hour by 2 × 104 at flare-enhanced conditions compared with a non-redundant system.
{"title":"A Self-Adaptive SEU Mitigation System for FPGAs with an Internal Block RAM Radiation Particle Sensor","authors":"R. Glein, Bernhard Schmidt, F. Rittner, J. Teich, Daniel Ziener","doi":"10.1109/FCCM.2014.79","DOIUrl":"https://doi.org/10.1109/FCCM.2014.79","url":null,"abstract":"In this paper, we propose a self-adaptive FPGA-based, partially reconfigurable system for space missions in order to mitigate Single Event Upsets in the FPGA configuration and fabric. Dynamic reconfiguration is used here for an on-demand replication of modules in dependence of current and changing radiation levels. More precisely, the idea is to trigger a redundancy scheme such as Dual Modular Redundancy or Triple Modular Redundancy in response to a continuously monitored Single Event Upset rate measured inside the on-chip memories itself, e.g., any subset (even used) internal Block RAMs. Depending on the current radiation level, the minimal number of replicas is determined at runtime under the constraint that a required Safety Integrity Level for a module is ensured and configured accordingly. For signal processing applications it is shown that this autonomous adaption to the different solar conditions realizes a resource efficient mitigation. In our case study, we show that it is possible to triplicate the data throughput at the Solar Maximum condition (no flares) compared to a Triple Modular Redundancy implementation of a single module. We also show the decreasing Probability of Failures Per Hour by 2 × 104 at flare-enhanced conditions compared with a non-redundant system.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124882374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises “move fairness” and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.
{"title":"Speeding Up FPGA Placement: Parallel Algorithms and Methods","authors":"Ma An, J. Gregory Steffan, Vaughn Betz","doi":"10.1109/FCCM.2014.60","DOIUrl":"https://doi.org/10.1109/FCCM.2014.60","url":null,"abstract":"Placement of a large FPGA design now commonly requires several hours, significantly hindering designer productivity. Furthermore, FPGA capacity is growing faster than CPU speed, which will further increase placement time unless new approaches are found. Multi-core processors are now ubiquitous, however, and some recent processors also have hardware support for transactional memory (TM), making parallelism an increasingly attractive approach for speeding up placement. We investigate methods to parallelize the simulated annealing placement algorithm in VPR, which is widely used in FPGA research. We explore both algorithmic changes and the use of different parallel programming paradigms and hardware, including TM, thread-level speculation (TLS) and lock-free techniques. We find that hardware TM enables large speedups (8.1x on average), but compromises “move fairness” and leads to an unacceptable quality loss. TLS scales poorly, with a maximum 2.2x speedup, but preserves quality. A new dependency checking parallel strategy achieves the best balance: the deterministic version achieves 5.9x speedup and no quality loss, while the non-deterministic, lock-free version can scale to a 34x speedup.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow
We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (opensource cloud software), thereby allowing users to “boot” custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.
{"title":"FPGAs in the Cloud: Booting Virtualized Hardware Accelerators with OpenStack","authors":"Stuart Byma, J. Steffan, H. Bannazadeh, Alberto Leon Garcia, P. Chow","doi":"10.1109/FCCM.2014.42","DOIUrl":"https://doi.org/10.1109/FCCM.2014.42","url":null,"abstract":"We present a new approach for integrating virtualized FPGA-based hardware accelerators into commercial-scale cloud computing systems, with minimal virtualization overhead. Partially reconfigurable regions across multiple FPGAs are offered as generic cloud resources through OpenStack (opensource cloud software), thereby allowing users to “boot” custom designed or predefined network-connected hardware accelerators with the same commands they would use to boot a regular Virtual Machine. We propose a hardware and software framework to enable this virtualization. This is a first attempt at closely fitting FPGAs into existing cloud computing models, where resources are virtualized, flexible, and have the illusion of infinite scalability. Our system can set up and tear down virtual accelerators in approximately 2.6 seconds on average, much faster than regular virtual machines. The static virtualization hardware on the physical FPGAs causes only a three cycle latency increase and a one cycle pipeline stall per packet in accelerators when compared to a non-virtualized system. We present a case study analyzing the design and performance of an application-level load balancer using a fully implemented prototype of our system. Our study shows that FPGA cloud compute resources can easily outperform virtual machines, while the system's virtualization and abstraction significantly reduces design iteration time and design complexity.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116311518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA lookup table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80×) lower power and a (~4×) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel global (spacial) as well as local (random) PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA LUT to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8× and 2.9× respectively. The dynamic body biasing circuits incur a 3.49% area overhead when designed to each drive a cluster of 25 LUTs.
{"title":"Look-up Table Design for Deep Sub-threshold through Full-Supply Operation","authors":"M. Abusultan, S. Khatri","doi":"10.1109/FCCM.2014.80","DOIUrl":"https://doi.org/10.1109/FCCM.2014.80","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA lookup table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80×) lower power and a (~4×) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel global (spacial) as well as local (random) PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA LUT to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8× and 2.9× respectively. The dynamic body biasing circuits incur a 3.49% area overhead when designed to each drive a cluster of 25 LUTs.","PeriodicalId":246162,"journal":{"name":"2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116788610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}