The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.
{"title":"Application specific processor with high level synthesized instructions (abstract only)","authors":"V. Pus, Pavel Benácek","doi":"10.1145/2554688.2554754","DOIUrl":"https://doi.org/10.1145/2554688.2554754","url":null,"abstract":"The paper deals with the design of application-specific processor which uses high level synthesized instruction engines. This approach is demonstrated on the instance of high speed network flow measurement processor for FPGA. Our newly proposed concept called Software Defined Monitoring (SDM) relies on advanced monitoring tasks implemented in the software supported by a configurable hardware accelerator. The monitoring tasks reside in the software and can easily control the level of detail retained by the hardware for each flow. This way, the measurement of bulk/uninteresting traffic is offloaded to the hardware, while the interesting traffic is processed in the software. SDM enables creation of flexible monitoring systems capable of deep packet inspection at high throughput. We introduce the processor architecture and a workflow that allows to create hardware accelerated measurement modules (instructions) from the description in C/C++ language. The processor offloads various aggregations and statistics from the main system CPU. The basic type of offload is the NetFlow statistics aggregation. We create and evaluate three more aggregation instructions to demonstrate the flexibility of our system. Compared to the hand-written instructions, the high level synthesized instructions are slightly worse in terms of both FPGA resources consumption and frequency. However, the time needed for development is approximately half.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123463375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.
{"title":"Pushing the performance boundary of linear projection designs through device specific optimisations (abstract only)","authors":"R. Duarte, C. Bouganis","doi":"10.1145/2554688.2554717","DOIUrl":"https://doi.org/10.1145/2554688.2554717","url":null,"abstract":"The continuous scaling of the fabrication process combined with the ever increasing need of high performance designs, means that the era of treating all devices the same is about to come to an end. The presented work considers device oriented optimisations in order to further boost the performance of a Linear Projection design by focusing on the over-clocking of arithmetic operators. A methodology is proposed for the acceleration of Linear Projection designs on an FPGA, that introduces information about the performance of the hardware under over-clocking conditions to the application level. The novelty of this method is a pre-characterisation of the most prone to error arithmetic operators and the utilisation of this information in the high-level optimization process of the design. This results in a set of circuit designs that achieve higher throughput with minimum error. FPGA devices are suitable for such optimisations due to their reconfigurability feature that allows performance characterisation of the underlying fabric prior to the design of the final system. The reported results show that significant gains in the performance of the system can be achieved, i.e. up to 1.85 times speed up in the throughput compared to existing methodologies, when such device specific optimisation is considered.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115558770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Architecture","authors":"M. Hutton","doi":"10.1145/3260937","DOIUrl":"https://doi.org/10.1145/3260937","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125928707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
FPGA behavioral synthesis has gained significant momentum recently with the growing interests in accelerating high-performance computing applications. While the latest generation of high-level synthesis (HLS) tools has made significant progress, they still lack the support for certain high-level language features such as dynamic memory allocation, despite the fact that efficiently utilization of the on-chip memory resources in FPGAs is critical to achieve the performance and power consumption target for many designs. To tackle the above problem, in this paper, we propose a novel hybrid memory allocation scheme to map malloc/free in C programing language onto FPGA platforms. By estimating the memory usage and available FPGA memory resources, the scheme judiciously allocates static memory blocks and/or instantiate hardware allocators for memory requests. And the partition between these two parts is based on estimated access counts and solving an ILP to minimize overhead from dynamic memory allocation. Experimental results on benchmark circuits demonstrate the efficacy of the proposed technique.
{"title":"On hybrid memory allocation for FPGA behavioral synthesis (abstract only)","authors":"Qian Zhang, Chenfei Ma, Q. Xu","doi":"10.1145/2554688.2554697","DOIUrl":"https://doi.org/10.1145/2554688.2554697","url":null,"abstract":"FPGA behavioral synthesis has gained significant momentum recently with the growing interests in accelerating high-performance computing applications. While the latest generation of high-level synthesis (HLS) tools has made significant progress, they still lack the support for certain high-level language features such as dynamic memory allocation, despite the fact that efficiently utilization of the on-chip memory resources in FPGAs is critical to achieve the performance and power consumption target for many designs. To tackle the above problem, in this paper, we propose a novel hybrid memory allocation scheme to map malloc/free in C programing language onto FPGA platforms. By estimating the memory usage and available FPGA memory resources, the scheme judiciously allocates static memory blocks and/or instantiate hardware allocators for memory requests. And the partition between these two parts is based on estimated access counts and solving an ILP to minimize overhead from dynamic memory allocation. Experimental results on benchmark circuits demonstrate the efficacy of the proposed technique.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125419844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Processors and systems","authors":"M. Leeser","doi":"10.1145/3260940","DOIUrl":"https://doi.org/10.1145/3260940","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131789643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.
{"title":"FPGA LUT design for wide-band dynamic voltage and frequency scaled operation (abstract only)","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2554688.2554708","DOIUrl":"https://doi.org/10.1145/2554688.2554708","url":null,"abstract":"Field programmable gate arrays (FPGAs) are the implementation platform of choice when it comes to design flexibility. However, the high power consumption of FPGAs (which arises due to their flexible structure), make them less appealing for extreme low power applications. In this paper, we present a design of an FPGA look-up table (LUT), with the goal of seamless operation over a wide band of supply voltages. The same LUT design has the ability to operate at sub-threshold voltage when low power is required, and at higher voltages whenever faster performance is required. The results show that operating the LUT in sub-threshold mode yields a (~80x) lower power and (~4x) lower energy than full supply voltage operation, for a 6-input LUT implemented in a 22nm predictive technology. The key drawback of sub-threshold operation is its susceptibility to process, temperature, and supply voltage (PVT) variations. This paper also presents the design and experimental results for a closed-loop adaptive body biasing mechanism to dynamically cancel these PVT variations. For the same 22nm technology, we demonstrate that the closed-loop adaptive body biasing circuits can allow the FPGA to operate over an operating frequency range that spans an order of magnitude (40 MHz to 1300 MHz). We also show that the closed-loop adaptive body biasing circuits can cancel delay variations due to supply voltage changes, and reduce the effect of process variations on setup and hold times by 1.8x and 2.9x respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116403757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optical flow computation is widely used in many video/image based applications such as motion detection, video compression etc. Dense optical flow field that provides more details of information is more useful in lots of applications. However, high-quality algorithms for dense optical flow computation are computationally expensive. For instance, on the ARM Cortex-A9 processor within ZYNQ, the popular linear variational method Combine-Brightness-Gradient (CBG), spends $26.68s per frame to compute optical flow when the image size is 640 x 480. It is difficult to be sped up especially when embedded systems with power constraints are considered. Poor portability is another factor to limit current implementations of optical flow computation to be used in more applications. In this paper, a high-performance, low-power FPGA-accelerated implementation of dense optical flow computation is presented. One high-quality dense optical flow method, the Combine-Brightness-Gradient model, is implemented. C code instead of VHDL/Verilog HDL is used to improve the productivity. Portability of the system is designed carefully for deploying it on different platforms conveniently. Experimental results show 12 fps and 0.38J per frame are achieved by this optical flow computing system when 640 x 480 image is used and optical flow for all pixels are computed. Furthermore, portability is demonstrated by implementing the optical flow algorithm on different heterogeneous platforms such as the ZYNQ-7000 SoC and the PC-FPGA platform with a Kintex-7 FPGA respectively.
{"title":"Implementing FPGA-based energy-efficient dense optical flow computation with high portability in C (abstract only)","authors":"Zhibin Wang, Wenmin Yang, Jin Yu, Zhilei Chai","doi":"10.1145/2554688.2554733","DOIUrl":"https://doi.org/10.1145/2554688.2554733","url":null,"abstract":"Optical flow computation is widely used in many video/image based applications such as motion detection, video compression etc. Dense optical flow field that provides more details of information is more useful in lots of applications. However, high-quality algorithms for dense optical flow computation are computationally expensive. For instance, on the ARM Cortex-A9 processor within ZYNQ, the popular linear variational method Combine-Brightness-Gradient (CBG), spends $26.68s per frame to compute optical flow when the image size is 640 x 480. It is difficult to be sped up especially when embedded systems with power constraints are considered. Poor portability is another factor to limit current implementations of optical flow computation to be used in more applications. In this paper, a high-performance, low-power FPGA-accelerated implementation of dense optical flow computation is presented. One high-quality dense optical flow method, the Combine-Brightness-Gradient model, is implemented. C code instead of VHDL/Verilog HDL is used to improve the productivity. Portability of the system is designed carefully for deploying it on different platforms conveniently. Experimental results show 12 fps and 0.38J per frame are achieved by this optical flow computing system when 640 x 480 image is used and optical flow for all pixels are computed. Furthermore, portability is demonstrated by implementing the optical flow algorithm on different heterogeneous platforms such as the ZYNQ-7000 SoC and the PC-FPGA platform with a Kintex-7 FPGA respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130140292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, electronic industries have been facing an increased amount of hardware counterfeits. These counterfeit components, when assembled into a product or a system, can not only jeopardize performance and reliability but also create safety issues. Physical Unclonable Function (PUF) provides means to enhance physical security of Integrated Circuits (IC) against piracy and unauthorized access. The proposed design illustrates the feasibility of using self-timed ring oscillators as a novel approach towards PUF implementation for FPGA authentication. The proposed Self-Timed Ring Oscillator PUF (STRO-PUF) consists of two groups of identically laid-out self-timed ring oscillators. Inputs to the PUF are given through a challenge generator, which selects two self-timed ring oscillators from each group. Outputs of oscillators are fed to multiplexers of corresponding groups. Self-timed ring oscillators exploit the inherent features of random process variations by producing varying frequencies. These unpredictable variations in frequencies are captured using frequency comparator, which generates a output bit. A unique set of output bits , or response is generated for each set of input bits, or challenge. This unique Challenge Response Pair (CRP) is used in identifying a particular device. Frequencies generated from these oscillators are read through a logic analyzer. The varying frequencies observed from all the oscillators mapped across different regions of FPGAs range from 16.234 MHz to 125 MHz with the average frequency of 101.446 MHz. Experimental result shows the uniqueness for the PUF response is 49.92% which is very close to the desired 50% factor.
{"title":"Asynchronous physical unclonable function using FPGA-based self-timed ring oscillator (abstract only)","authors":"R. Silwal, M. Niamat","doi":"10.1145/2554688.2554745","DOIUrl":"https://doi.org/10.1145/2554688.2554745","url":null,"abstract":"Recently, electronic industries have been facing an increased amount of hardware counterfeits. These counterfeit components, when assembled into a product or a system, can not only jeopardize performance and reliability but also create safety issues. Physical Unclonable Function (PUF) provides means to enhance physical security of Integrated Circuits (IC) against piracy and unauthorized access. The proposed design illustrates the feasibility of using self-timed ring oscillators as a novel approach towards PUF implementation for FPGA authentication. The proposed Self-Timed Ring Oscillator PUF (STRO-PUF) consists of two groups of identically laid-out self-timed ring oscillators. Inputs to the PUF are given through a challenge generator, which selects two self-timed ring oscillators from each group. Outputs of oscillators are fed to multiplexers of corresponding groups. Self-timed ring oscillators exploit the inherent features of random process variations by producing varying frequencies. These unpredictable variations in frequencies are captured using frequency comparator, which generates a output bit. A unique set of output bits , or response is generated for each set of input bits, or challenge. This unique Challenge Response Pair (CRP) is used in identifying a particular device. Frequencies generated from these oscillators are read through a logic analyzer. The varying frequencies observed from all the oscillators mapped across different regions of FPGAs range from 16.234 MHz to 125 MHz with the average frequency of 101.446 MHz. Experimental result shows the uniqueness for the PUF response is 49.92% which is very close to the desired 50% factor.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130242236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Hutchings, Joshua S. Monson, D. Savory, J. Keeley
A novel Digital to Analog Converter (DAC) modulates the overall power consumption of an FPGA by disabling/enabling short circuits programmed into the interconnect. The power pin of the FPGA serves as the output of the DAC. The DAC achieves high linearity and can be used to implement applications in communications, security, etc. The shortcircuit-based DAC consumes 1/3 the area of an alternative shift-register-based DAC that is presented for the sake of comparison.
{"title":"A power side-channel-based digital to analog converterfor Xilinx FPGAs","authors":"B. Hutchings, Joshua S. Monson, D. Savory, J. Keeley","doi":"10.1145/2554688.2554770","DOIUrl":"https://doi.org/10.1145/2554688.2554770","url":null,"abstract":"A novel Digital to Analog Converter (DAC) modulates the overall power consumption of an FPGA by disabling/enabling short circuits programmed into the interconnect. The power pin of the FPGA serves as the output of the DAC. The DAC achieves high linearity and can be used to implement applications in communications, security, etc. The shortcircuit-based DAC consumes 1/3 the area of an alternative shift-register-based DAC that is presented for the sake of comparison.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125770085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents an adaptive heap sort architecture for an image coding implementation on FPGA, which specifically addresses the issue of sorting different amount of data located in each subband during the coding. The proposed sorting architecture is easily scalable. Performance of the sorter only depends on the amount of data sorted. The efficient usage of dual port memories yields high throughput up to 50 Msamples/s and their adaptive trigger/shutdown provide the average dynamic power reduction up to 20.9%. We designed this architecture and incorporated it in our Adaptive Scanning of Wavelet Data (ASWD) module which reorganizes the wavelet coefficients into locally stationary sequences for a wavelet-based image encoder. We validated the hardware on an Altera's Stratix IV FPGA as an IP accelerator in a Nios II processor based System on Chip. The architectural innovations can also be exploited in other applications that require high throughput and scalable sorting. Our experiments show that compared to an embedded ARM CortexA9 processor running at 666 MHz, our architecture at 100 MHz can provide around 13X speedup while consuming 242 mW average core dynamic power.
本文提出了一种用于FPGA图像编码实现的自适应堆排序架构,该架构具体解决了编码过程中位于每个子带的不同数据量的排序问题。所建议的排序体系结构易于扩展。排序器的性能仅取决于排序的数据量。双端口存储器的有效使用可产生高达50 Msamples/s的高吞吐量,其自适应触发/关闭可提供高达20.9%的平均动态功耗降低。我们设计了这种结构,并将其整合到我们的小波数据自适应扫描(ASWD)模块中,该模块将小波系数重组为局部平稳序列,用于基于小波的图像编码器。我们在Altera的Stratix IV FPGA上验证了硬件作为基于Nios II处理器的片上系统的IP加速器。架构上的创新也可以用于其他需要高吞吐量和可扩展排序的应用程序。我们的实验表明,与运行在666 MHz的嵌入式ARM CortexA9处理器相比,我们的架构在100 MHz时可以提供大约13倍的加速,同时消耗242 mW的平均核心动态功率。
{"title":"A power-efficient adaptive heapsort for fpga-based image coding application (abstract only)","authors":"Yuhui Bai, S. Z. Ahmed, B. Granado","doi":"10.1145/2554688.2554746","DOIUrl":"https://doi.org/10.1145/2554688.2554746","url":null,"abstract":"This paper presents an adaptive heap sort architecture for an image coding implementation on FPGA, which specifically addresses the issue of sorting different amount of data located in each subband during the coding. The proposed sorting architecture is easily scalable. Performance of the sorter only depends on the amount of data sorted. The efficient usage of dual port memories yields high throughput up to 50 Msamples/s and their adaptive trigger/shutdown provide the average dynamic power reduction up to 20.9%. We designed this architecture and incorporated it in our Adaptive Scanning of Wavelet Data (ASWD) module which reorganizes the wavelet coefficients into locally stationary sequences for a wavelet-based image encoder. We validated the hardware on an Altera's Stratix IV FPGA as an IP accelerator in a Nios II processor based System on Chip. The architectural innovations can also be exploited in other applications that require high throughput and scalable sorting. Our experiments show that compared to an embedded ARM CortexA9 processor running at 666 MHz, our architecture at 100 MHz can provide around 13X speedup while consuming 242 mW average core dynamic power.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132479074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}