This poster presents our preliminary findings on the relationship between speedup and energy efficiency on FPGA based Chip Heterogeneous Multiprocessor Systems (CHMPs). While researchers have investigated how to tailor combinations of heterogeneous compute engines within a CHMP system to best meet the performance needs of specific applications, exploring how these optimized architectures also effect energy efficiency is not as well studied. We show that a simple relationship exists between the speedup these systems gain and their associated energy efficiency. We show that the simple relationship between Amdahl's law and energy efficiency. All the experiments result achieved through actual run time measurements on homogeneous and heterogeneous multiprocessor systems implemented within a Xilinx Virtex6 FPGA. We further show how a systems with 6 MicroBlaze soft processors' dynamic power and hence the overall energy efficiency of the system can be effected through transparent operating system control of the compute resources. We also present how to use clock gating to control the dynamic power consumption for each processor and with this careful power-aware management unit, the system's dynamic power consumption can follow the requirements of each application.
{"title":"On energy efficiency and amdahl's law in FPGA based chip heterogeneous multiprocessor systems (abstract only)","authors":"Sen Ma, D. Andrews","doi":"10.1145/2554688.2554719","DOIUrl":"https://doi.org/10.1145/2554688.2554719","url":null,"abstract":"This poster presents our preliminary findings on the relationship between speedup and energy efficiency on FPGA based Chip Heterogeneous Multiprocessor Systems (CHMPs). While researchers have investigated how to tailor combinations of heterogeneous compute engines within a CHMP system to best meet the performance needs of specific applications, exploring how these optimized architectures also effect energy efficiency is not as well studied. We show that a simple relationship exists between the speedup these systems gain and their associated energy efficiency. We show that the simple relationship between Amdahl's law and energy efficiency. All the experiments result achieved through actual run time measurements on homogeneous and heterogeneous multiprocessor systems implemented within a Xilinx Virtex6 FPGA. We further show how a systems with 6 MicroBlaze soft processors' dynamic power and hence the overall energy efficiency of the system can be effected through transparent operating system control of the compute resources. We also present how to use clock gating to control the dynamic power consumption for each processor and with this careful power-aware management unit, the system's dynamic power consumption can follow the requirements of each application.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Processors with an embedded runtime reconfigurable fabric have been explored in academia and industry started production of commercial platforms (e.g. Xilinx Zynq-7000). While providing significant performance and efficiency, the comparatively long reconfiguration time limits these advantages when applications request reconfigurations frequently. In multi-tasking systems frequent task switches lead to frequent reconfigurations and thus are a major hurdle for further performance increases. Sophisticated task scheduling is a very effective means to reduce the negative impact of these reconfiguration requests. In this paper, we propose an online approach for combined task scheduling and re-distribution of reconfigurable fabric between tasks in order to reduce the makespan, i.e. the completion time of a taskset that executes on a runtime reconfigurable processor. Evaluating multiple tasksets comprised of multimedia applications, our proposed approach achieves makespans that are on average only 2.8% worse than those achieved by a theoretical optimal scheduling that assumes zero-overhead reconfiguration time. In comparison, scheduling approaches deployed in state-of-the-art reconfigurable processors achieve makespans 14%-20% worse than optimal. As our approach is a purely software-side mechanism, a multitude of reconfigurable platforms aimed at multi-tasking can benefit from it.
{"title":"MORP: makespan optimization for processors with an embedded reconfigurable fabric","authors":"Artjom Grudnitsky, L. Bauer, J. Henkel","doi":"10.1145/2554688.2554782","DOIUrl":"https://doi.org/10.1145/2554688.2554782","url":null,"abstract":"Processors with an embedded runtime reconfigurable fabric have been explored in academia and industry started production of commercial platforms (e.g. Xilinx Zynq-7000). While providing significant performance and efficiency, the comparatively long reconfiguration time limits these advantages when applications request reconfigurations frequently. In multi-tasking systems frequent task switches lead to frequent reconfigurations and thus are a major hurdle for further performance increases. Sophisticated task scheduling is a very effective means to reduce the negative impact of these reconfiguration requests. In this paper, we propose an online approach for combined task scheduling and re-distribution of reconfigurable fabric between tasks in order to reduce the makespan, i.e. the completion time of a taskset that executes on a runtime reconfigurable processor. Evaluating multiple tasksets comprised of multimedia applications, our proposed approach achieves makespans that are on average only 2.8% worse than those achieved by a theoretical optimal scheduling that assumes zero-overhead reconfiguration time. In comparison, scheduling approaches deployed in state-of-the-art reconfigurable processors achieve makespans 14%-20% worse than optimal. As our approach is a purely software-side mechanism, a multitude of reconfigurable platforms aimed at multi-tasking can benefit from it.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115743071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuliang Sun, Zilong Wang, Sitao Huang, Lanjun Wang, Yu Wang, Rong Luo, Huazhong Yang
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
{"title":"Accelerating frequent item counting with FPGA","authors":"Yuliang Sun, Zilong Wang, Sitao Huang, Lanjun Wang, Yu Wang, Rong Luo, Huazhong Yang","doi":"10.1145/2554688.2554766","DOIUrl":"https://doi.org/10.1145/2554688.2554766","url":null,"abstract":"Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121238932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials.
{"title":"Square-rich fixed point polynomial evaluation on FPGAs","authors":"Simin Xu, Suhaib A. Fahmy, I. Mcloughlin","doi":"10.1145/2554688.2554779","DOIUrl":"https://doi.org/10.1145/2554688.2554779","url":null,"abstract":"Polynomial evaluation is important across a wide range of application domains, so significant work has been done on accelerating its computation. The conventional algorithm, referred to as Horner's rule, involves the least number of steps but can lead to increased latency due to serial computation. Parallel evaluation algorithms such as Estrin's method have shorter latency than Horner's rule, but achieve this at the expense of large hardware overhead. This paper presents an efficient polynomial evaluation algorithm, which reforms the evaluation process to include an increased number of squaring steps. By using a squarer design that is more efficient than general multiplication, this can result in polynomial evaluation with a 57.9% latency reduction over Horner's rule and 14.6% over Estrin's method, while consuming less area than Horner's rule, when implemented on a Xilinx Virtex 6 FPGA. When applied in fixed point function evaluation, where precision requirements limit the rounding of operands, it still achieves a 52.4% performance gain compared to Horner's rule with only a 4% area overhead in evaluating 5th degree polynomials.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130692676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the processing of multi-core and cloud computing applications based on the MapReduce programming framework. The proposed MapReduce configurable accelerator is augmented to multi-core processors and it performs a fast indexing and accumulation of the key/value pairs based on an efficient memory architecture using Cuckoo hashing. The MapReduce accelerator consists of the memory buffers that store the key/value pairs, and the processing units that are used to accumulate the key's value sent from the processors. In essence, this accelerator is used to alleviate the processors from executing the Reduce tasks, and thus executing only the Map tasks and emitting the intermediate key/value pairs to the hardware acceleration unit that performs the Reduce operation. The number and the size of the keys that can be stored on the accelerator are configurable and can be configured based on the application requirements. The MapReduce accelerator has been implemented and mapped to a multi-core FPGA with embedded ARM processors (Xilinx Zynq FPGA) and has been integrated with the MapReduce programming framework under Linux. The performance evaluation shows that the proposed accelerator can achieve up to 1.8x system speedup of the MapReduce applications and hence reduce significantly the execution time of multi-core and cloud computing applications. (Action: "Supporting Postdoctoral Researchers", "Education and Lifelong Learning" Program (GSRT) and co-financed by the ESF and the Greek State.)
{"title":"A configurable mapreduce accelerator for multi-core FPGAs (abstract only)","authors":"C. Kachris, G. Sirakoulis, D. Soudris","doi":"10.1145/2554688.2554700","DOIUrl":"https://doi.org/10.1145/2554688.2554700","url":null,"abstract":"MapReduce is a widely used programming framework for the implementation of cloud computing application in data centers. This work presents a novel configurable hardware accelerator that is used to speed up the processing of multi-core and cloud computing applications based on the MapReduce programming framework. The proposed MapReduce configurable accelerator is augmented to multi-core processors and it performs a fast indexing and accumulation of the key/value pairs based on an efficient memory architecture using Cuckoo hashing. The MapReduce accelerator consists of the memory buffers that store the key/value pairs, and the processing units that are used to accumulate the key's value sent from the processors. In essence, this accelerator is used to alleviate the processors from executing the Reduce tasks, and thus executing only the Map tasks and emitting the intermediate key/value pairs to the hardware acceleration unit that performs the Reduce operation. The number and the size of the keys that can be stored on the accelerator are configurable and can be configured based on the application requirements. The MapReduce accelerator has been implemented and mapped to a multi-core FPGA with embedded ARM processors (Xilinx Zynq FPGA) and has been integrated with the MapReduce programming framework under Linux. The performance evaluation shows that the proposed accelerator can achieve up to 1.8x system speedup of the MapReduce applications and hence reduce significantly the execution time of multi-core and cloud computing applications. (Action: \"Supporting Postdoctoral Researchers\", \"Education and Lifelong Learning\" Program (GSRT) and co-financed by the ESF and the Greek State.)","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126455964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Gong, Jiahua Chen, Haoyang Wu, Fan Ye, Songwu Lu, J. Cong, Tao Wang
The rapid growth in the resources and processing power of FPGA has made it more and more attractive as accelerator platforms. Due to its high performance, the PCIe bus is the preferred interconnection between the host computer and loosely-coupled FPGA accelerators. To fully utilize the high performance of PCIe, developers have to write significant amount of PCIe related code. In this paper, we present the design of EPEE, an efficient PCIe communication library that can integrate with hosts easily to alleviate developers from such burden. It is not trivial to make a PCIe communication library highly efficient and easy-host-integration simultaneously. We have identified several challenges in the work: 1) the conflict between efficiency and functionality; 2) the support for multi-clock domain interface; 3) the solution to DMA data out-of-order transfer; 4) the portability. Few existing systems have addressed all the challenges. EEPE has a highly efficient core library that is extensible. We provide a set of APIs abstracted at high levels to ease the learning curve of developers, and divide the hardware library into device dependent and independent layers for portability. We have implemented EEPE in various generations of Xilinx FPGAs with up to 12.7 Gbps half-duplex and 20.8 Gbps full-duplex data rates in PCIe Gen2X4 mode (79.4% and 64.0% of the theoretical maximum data rates respectively). EEPE has already been used in four different FPGA applications, and it can be integrated with high-level synthesis tools, in particular Vivado-HLS.
{"title":"EPEE: an efficient PCIe communication library with easy-host-integration property for FPGA accelerators (abstract only)","authors":"Jian Gong, Jiahua Chen, Haoyang Wu, Fan Ye, Songwu Lu, J. Cong, Tao Wang","doi":"10.1145/2554688.2554723","DOIUrl":"https://doi.org/10.1145/2554688.2554723","url":null,"abstract":"The rapid growth in the resources and processing power of FPGA has made it more and more attractive as accelerator platforms. Due to its high performance, the PCIe bus is the preferred interconnection between the host computer and loosely-coupled FPGA accelerators. To fully utilize the high performance of PCIe, developers have to write significant amount of PCIe related code. In this paper, we present the design of EPEE, an efficient PCIe communication library that can integrate with hosts easily to alleviate developers from such burden. It is not trivial to make a PCIe communication library highly efficient and easy-host-integration simultaneously. We have identified several challenges in the work: 1) the conflict between efficiency and functionality; 2) the support for multi-clock domain interface; 3) the solution to DMA data out-of-order transfer; 4) the portability. Few existing systems have addressed all the challenges. EEPE has a highly efficient core library that is extensible. We provide a set of APIs abstracted at high levels to ease the learning curve of developers, and divide the hardware library into device dependent and independent layers for portability. We have implemented EEPE in various generations of Xilinx FPGAs with up to 12.7 Gbps half-duplex and 20.8 Gbps full-duplex data rates in PCIe Gen2X4 mode (79.4% and 64.0% of the theoretical maximum data rates respectively). EEPE has already been used in four different FPGA applications, and it can be integrated with high-level synthesis tools, in particular Vivado-HLS.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129259840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Although the reliability and robustness of the AES protocol have been deeply proved through the years, recent research results and technology advancements are rising serious concerns about its solidity in the (quite near) future. In fact, smarter brute force attacks and new computing systems are expected to drastically decrease the security of the AES protocol in the coming years (e.g., quantum computing will enable the development of search algorithms able to perform a brute force attack of a 2n-bit key in the same time required by a conventional algorithm for a n-bit key). In this context, we are proposing an extension of the AES algorithm in order to support longer encryption keys (thus increasing the security of the algorithm itself). In addition to this, we are proposing a set of parametric implementations of this novel extended protocols. These architectures can be optimized either to minimize the area usage or to maximize their performance. Experimental results show that, while the proposed implementations achieve a throughput higher than most of the state-of-the-art approaches and the highest value of the Performance/Area metric when working with 128-bit encryption keys, they can achieve a 84× throughput speedup when compared to the approaches that can be found in literature working with 512-bit encryption keys.
{"title":"Improving the security and the scalability of the AES algorithm (abstract only)","authors":"A. A. Nacci, V. Rana, M. Santambrogio, D. Sciuto","doi":"10.1145/2554688.2554735","DOIUrl":"https://doi.org/10.1145/2554688.2554735","url":null,"abstract":"Although the reliability and robustness of the AES protocol have been deeply proved through the years, recent research results and technology advancements are rising serious concerns about its solidity in the (quite near) future. In fact, smarter brute force attacks and new computing systems are expected to drastically decrease the security of the AES protocol in the coming years (e.g., quantum computing will enable the development of search algorithms able to perform a brute force attack of a 2n-bit key in the same time required by a conventional algorithm for a n-bit key). In this context, we are proposing an extension of the AES algorithm in order to support longer encryption keys (thus increasing the security of the algorithm itself). In addition to this, we are proposing a set of parametric implementations of this novel extended protocols. These architectures can be optimized either to minimize the area usage or to maximize their performance. Experimental results show that, while the proposed implementations achieve a throughput higher than most of the state-of-the-art approaches and the highest value of the Performance/Area metric when working with 128-bit encryption keys, they can achieve a 84× throughput speedup when compared to the approaches that can be found in literature working with 512-bit encryption keys.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130995926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.
{"title":"Application specific processor with high level synthesized instructions (abstract only)","authors":"V. Pus, Pavel Benácek","doi":"10.1145/2554688.2554754","DOIUrl":"https://doi.org/10.1145/2554688.2554754","url":null,"abstract":"The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.
{"title":"Pushing the performance boundary of linear projection designs through device specific optimisations (abstract only)","authors":"R. Duarte, C. Bouganis","doi":"10.1145/2554688.2554717","DOIUrl":"https://doi.org/10.1145/2554688.2554717","url":null,"abstract":"The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115558770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.
{"title":"FPGA LUT design for wide-band dynamic voltage and frequency scaled operation (abstract only)","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2554688.2554708","DOIUrl":"https://doi.org/10.1145/2554688.2554708","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}