Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245722
Pei Luo, Liwei Zhang, Yunsi Fei, A. Ding
Side-channel attacks have been a real threat against many embedded cryptographic systems. A commonly used algorithmic countermeasure, random masking, incurs large execution delay and resource overhead. The other countermeasure, operation shuffling or permutation, can mitigate side-channel leakage effectively with minimal overhead. In this paper, we target automatically implementing operation shuffling in cryptographic algorithms to resist against side-channel power analysis attacks. We design a tool to detect independence among statements at the source code level and devise an algorithm for automatic operation shuffling. We test our algorithm on the new SHA3 standard, Keccak. Results show that the tool effectively implements operation-shuffling to reduce the side-channel leakage significantly, and therefore can guide automatic secure cryptographic software implementations against differential power analysis attacks.
{"title":"Towards secure cryptographic software implementation against side-channel power analysis attacks","authors":"Pei Luo, Liwei Zhang, Yunsi Fei, A. Ding","doi":"10.1109/ASAP.2015.7245722","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245722","url":null,"abstract":"Side-channel attacks have been a real threat against many embedded cryptographic systems. A commonly used algorithmic countermeasure, random masking, incurs large execution delay and resource overhead. The other countermeasure, operation shuffling or permutation, can mitigate side-channel leakage effectively with minimal overhead. In this paper, we target automatically implementing operation shuffling in cryptographic algorithms to resist against side-channel power analysis attacks. We design a tool to detect independence among statements at the source code level and devise an algorithm for automatic operation shuffling. We test our algorithm on the new SHA3 standard, Keccak. Results show that the tool effectively implements operation-shuffling to reduce the side-channel leakage significantly, and therefore can guide automatic secure cryptographic software implementations against differential power analysis attacks.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"83 1","pages":"144-148"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88221012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245698
Nachiket Kapre
FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We develop a stripped-down soft processor ISA to implement specific repetitive operations on graph nodes and edges that are commonly observed in sparse graph computations. In the processing core, we provide hardware support for rapidly fetching and processing state of local graph nodes and edges through spatial address generators and zero-overhead loop iterators. We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions in the soft processor. We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories. We outperform a Microblaze (100MHz on Zedboard) and an NIOS-II/f (100MHz on DE2-115) by 6× (single processor design) as well as the ARMv7 dual-core CPU on the Zynq SoCs by as much as 10× on the Xilinx ZC706 board (100 processor design) across a range of matrix datasets.
{"title":"Custom FPGA-based soft-processors for sparse graph acceleration","authors":"Nachiket Kapre","doi":"10.1109/ASAP.2015.7245698","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245698","url":null,"abstract":"FPGA-based soft processors customized for operations on sparse graphs can deliver significant performance improvements over conventional organizations (ARMv7 CPUs) for bulk synchronous sparse graph algorithms. We develop a stripped-down soft processor ISA to implement specific repetitive operations on graph nodes and edges that are commonly observed in sparse graph computations. In the processing core, we provide hardware support for rapidly fetching and processing state of local graph nodes and edges through spatial address generators and zero-overhead loop iterators. We interconnect a 2D array of these lightweight processors with a packet-switched network-on-chip to enable fine-grained operand routing along the graph edges and provide custom send/receive instructions in the soft processor. We develop the processor RTL using Vivado High-Level Synthesis and also provide an assembler and compilation flow to configure the processor instruction and data memories. We outperform a Microblaze (100MHz on Zedboard) and an NIOS-II/f (100MHz on DE2-115) by 6× (single processor design) as well as the ARMv7 dual-core CPU on the Zynq SoCs by as much as 10× on the Xilinx ZC706 board (100 processor design) across a range of matrix datasets.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"20 1","pages":"9-16"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81603801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245719
Mihai Maruseac, Gabriel Ghinita, Ming Ouyang, R. Rughinis
Private Information Retrieval (PIR) protocols allow users to search for data items stored at an untrusted server, without disclosing to the server the search attributes. Several computational PIR protocols provide cryptographic-strength guarantees for the privacy of users, building upon well-known hard mathematical problems, such as factorisation of large integers. Unfortunately, the computational-intensive nature of these solutions results in significant performance overhead, preventing their adoption in practice. In this paper, we employ graphical processing units (GPUs) to speed up the cryptographic operations required by PIR. We identify the challenges that arise when using GPUs for PIR and we propose solutions to address them. To the best of our knowledge, this is the first work to use GPUs for efficient private information retrieval, and an important first step towards GPU-based acceleration of a broader range of secure data operations. Our experimental evaluation shows that GPUs improve performance by more than an order of magnitude.
私有信息检索(Private Information Retrieval, PIR)协议允许用户搜索存储在不受信任的服务器上的数据项,而无需向服务器透露搜索属性。一些计算PIR协议为用户的隐私提供了加密强度保证,它们建立在众所周知的数学难题(如大整数的因数分解)之上。不幸的是,这些解决方案的计算密集型特性导致了显著的性能开销,阻碍了它们在实践中的采用。在本文中,我们使用图形处理单元(gpu)来加快PIR所需的加密操作。我们确定了将gpu用于PIR时出现的挑战,并提出了解决这些挑战的解决方案。据我们所知,这是第一个使用gpu进行高效私人信息检索的工作,也是迈向基于gpu的更广泛安全数据操作加速的重要的第一步。我们的实验评估表明,gpu提高性能超过一个数量级。
{"title":"Hardware acceleration of Private Information Retrieval protocols using GPUs","authors":"Mihai Maruseac, Gabriel Ghinita, Ming Ouyang, R. Rughinis","doi":"10.1109/ASAP.2015.7245719","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245719","url":null,"abstract":"Private Information Retrieval (PIR) protocols allow users to search for data items stored at an untrusted server, without disclosing to the server the search attributes. Several computational PIR protocols provide cryptographic-strength guarantees for the privacy of users, building upon well-known hard mathematical problems, such as factorisation of large integers. Unfortunately, the computational-intensive nature of these solutions results in significant performance overhead, preventing their adoption in practice. In this paper, we employ graphical processing units (GPUs) to speed up the cryptographic operations required by PIR. We identify the challenges that arise when using GPUs for PIR and we propose solutions to address them. To the best of our knowledge, this is the first work to use GPUs for efficient private information retrieval, and an important first step towards GPU-based acceleration of a broader range of secure data operations. Our experimental evaluation shows that GPUs improve performance by more than an order of magnitude.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"4 1","pages":"120-127"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87527543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245700
Nasim Farahini, A. Hemani
The increasing demand for higher resolution of images and communication bandwidth requires the streaming applications to deal with ever increasing size of datasets. Further, with technology scaling the cost of moving data is reducing at a slower pace compared to the cost of computing. These trends have motivated the proposed micro-architectural reorganization of stream processors by dividing the stream computation into functional computation, address constraints computation and address generation and deploying independent, distributed micro-threads to implement them. This scheme is an alternative to parallelizing them at instruction level. The proposed scheme has two benefits: a more efficient sequencer logic and energy savings in address generation and transportation. These benefits are quantified for a set of streaming applications and show average percentage improvement of 39 in silicon efficiency of the sequencer logic and 23 in total computational efficiency.
{"title":"Atomic stream computation unit based on micro-thread level parallelism","authors":"Nasim Farahini, A. Hemani","doi":"10.1109/ASAP.2015.7245700","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245700","url":null,"abstract":"The increasing demand for higher resolution of images and communication bandwidth requires the streaming applications to deal with ever increasing size of datasets. Further, with technology scaling the cost of moving data is reducing at a slower pace compared to the cost of computing. These trends have motivated the proposed micro-architectural reorganization of stream processors by dividing the stream computation into functional computation, address constraints computation and address generation and deploying independent, distributed micro-threads to implement them. This scheme is an alternative to parallelizing them at instruction level. The proposed scheme has two benefits: a more efficient sequencer logic and energy savings in address generation and transportation. These benefits are quantified for a set of streaming applications and show average percentage improvement of 39 in silicon efficiency of the sequencer logic and 23 in total computational efficiency.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"28 1","pages":"25-29"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88517005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245710
M. A. Arslan, F. Gruian, K. Kuchcinski
Custom architectures are often adopted as more efficient alternatives to general purpose processors in terms of performance and power. However, the design of such architectures requires experts both in hardware and the application domain. In this paper we propose a method for speeding up the design space exploration. Our method, based on Pareto points, identifies sets of solutions in terms of scalar units and vector units of certain length, fulfilling the throughput constraints for each application in a given set. Architectures can then be selected by combining these solutions, as starting points for a more thorough, model-based evaluation.
{"title":"Application-set driven exploration for custom processor architectures","authors":"M. A. Arslan, F. Gruian, K. Kuchcinski","doi":"10.1109/ASAP.2015.7245710","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245710","url":null,"abstract":"Custom architectures are often adopted as more efficient alternatives to general purpose processors in terms of performance and power. However, the design of such architectures requires experts both in hardware and the application domain. In this paper we propose a method for speeding up the design space exploration. Our method, based on Pareto points, identifies sets of solutions in terms of scalar units and vector units of certain length, fulfilling the throughput constraints for each application in a given set. Architectures can then be selected by combining these solutions, as starting points for a more thorough, model-based evaluation.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"77 1","pages":"70-71"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74048299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245724
Xin Fang, Pei Luo, Yunsi Fei, M. Leeser
Side-channel attacks have been a serious threat to the security of embedded cryptographic systems, and various countermeasures have been devised to mitigate the leakages. Power balance technologies such as wave dynamic differential logic (WDDL) aim to balance the power by introducing differential logic. However, different routing length leads to different capacitance of wire, and this hampers the strength of the power balance countermeasure. In this paper, we further balance the power of differential signals by manipulating the lower level primitives and placement constraints on a Field Programmable Gate Array (FPGA). We choose Advanced Encryption Standard (AES) as the encryption algorithm and apply Hamming weight model to demonstrate the amount of leakage for different implementations. Results show that our method not only efficiently mitigates the side-channel leakage but also saves FPGA logic block resources and dynamic power consumption.
{"title":"Balance power leakage to fight against side-channel analysis at gate level in FPGAs","authors":"Xin Fang, Pei Luo, Yunsi Fei, M. Leeser","doi":"10.1109/ASAP.2015.7245724","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245724","url":null,"abstract":"Side-channel attacks have been a serious threat to the security of embedded cryptographic systems, and various countermeasures have been devised to mitigate the leakages. Power balance technologies such as wave dynamic differential logic (WDDL) aim to balance the power by introducing differential logic. However, different routing length leads to different capacitance of wire, and this hampers the strength of the power balance countermeasure. In this paper, we further balance the power of differential signals by manipulating the lower level primitives and placement constraints on a Field Programmable Gate Array (FPGA). We choose Advanced Encryption Standard (AES) as the encryption algorithm and apply Hamming weight model to demonstrate the amount of leakage for different implementations. Results show that our method not only efficiently mitigates the side-channel leakage but also saves FPGA logic block resources and dynamic power consumption.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"126 1","pages":"154-155"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78295747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245721
Tedy Thomas, Arman Pouraghily, Kekai Hu, R. Tessier, T. Wolf
Embedded systems require low overhead security approaches to ensure that they are protected from attacks. In this paper, we propose a hardware-based approach to secure the operation of an embedded processor instruction-by-instruction, where deviations from expected program behavior are detected within the execution of an instruction. These security-enabled embedded processors provide effective defenses against common attacks, such as stack smashing. Previous work in this area has focused on monitoring a single task on a CPU while here we present a novel hardware monitoring system that can monitor multiple active tasks in an operating-system-based platform. The hardware monitor is able to track context switches that occur in the operating system and ensure that monitoring is performed continuously, thus ensuring system security. We present the design of our system and results obtained from a prototype implementation of the system on an Altera DE4 FPGA board. We demonstrate in hardware that applications can be monitored at the instruction level without execution slowdown and stack smashing attacks can be defeated using our system.
{"title":"Multi-task support for security-enabled embedded processors","authors":"Tedy Thomas, Arman Pouraghily, Kekai Hu, R. Tessier, T. Wolf","doi":"10.1109/ASAP.2015.7245721","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245721","url":null,"abstract":"Embedded systems require low overhead security approaches to ensure that they are protected from attacks. In this paper, we propose a hardware-based approach to secure the operation of an embedded processor instruction-by-instruction, where deviations from expected program behavior are detected within the execution of an instruction. These security-enabled embedded processors provide effective defenses against common attacks, such as stack smashing. Previous work in this area has focused on monitoring a single task on a CPU while here we present a novel hardware monitoring system that can monitor multiple active tasks in an operating-system-based platform. The hardware monitor is able to track context switches that occur in the operating system and ensure that monitoring is performed continuously, thus ensuring system security. We present the design of our system and results obtained from a prototype implementation of the system on an Altera DE4 FPGA board. We demonstrate in hardware that applications can be monitored at the instruction level without execution slowdown and stack smashing attacks can be defeated using our system.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"136-143"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86000535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245732
Erkan Diken, M. O'Riordan, Roel Jordans, L. Józwiak, H. Corporaal, D. Moloney
The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%.
{"title":"Mixed-length SIMD code generation for VLIW architectures with multiple native vector-widths","authors":"Erkan Diken, M. O'Riordan, Roel Jordans, L. Józwiak, H. Corporaal, D. Moloney","doi":"10.1109/ASAP.2015.7245732","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245732","url":null,"abstract":"The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"9 1","pages":"181-188"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88614205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245720
M. Lee, Yongje Lee, J. Cheon, Y. Paek
Recently, the usage of GPU is not limited to the jobs associated with graphics and a wide variety of applications take advantage of the flexibility of GPUs to accelerate the computing performance. Among them, one of the most emerging applications is the fully homomorphic encryption (FHE) scheme, which enables arbitrary computations on encrypted data. Despite much research effort, it cannot be considered as practical due to the enormous amount of computations, especially in the bootstrapping procedure. In this paper, we accelerate the performance of the recently suggested fast bootstrapping method in FHEW scheme using GPUs, as a case study of a FHE scheme. In order to optimize, we explored the reference code and carried out profiling to find out candidates for performance acceleration. Based on the profiling results, combined with more flexible tradeoff method, we optimized the bootstrapping algorithm in FHEW using GPU and CUDA's programming model. The empirical result shows that the bootstrapping of FHEW ciphertext can be done in less than 0.11 second after optimization.
{"title":"Accelerating bootstrapping in FHEW using GPUs","authors":"M. Lee, Yongje Lee, J. Cheon, Y. Paek","doi":"10.1109/ASAP.2015.7245720","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245720","url":null,"abstract":"Recently, the usage of GPU is not limited to the jobs associated with graphics and a wide variety of applications take advantage of the flexibility of GPUs to accelerate the computing performance. Among them, one of the most emerging applications is the fully homomorphic encryption (FHE) scheme, which enables arbitrary computations on encrypted data. Despite much research effort, it cannot be considered as practical due to the enormous amount of computations, especially in the bootstrapping procedure. In this paper, we accelerate the performance of the recently suggested fast bootstrapping method in FHEW scheme using GPUs, as a case study of a FHE scheme. In order to optimize, we explored the reference code and carried out profiling to find out candidates for performance acceleration. Based on the profiling results, combined with more flexible tradeoff method, we optimized the bootstrapping algorithm in FHEW using GPU and CUDA's programming model. The empirical result shows that the bootstrapping of FHEW ciphertext can be done in less than 0.11 second after optimization.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"128-135"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90235270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245733
K. Hill, S. Craciun, A. George, H. Lam
Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous productivity challenges, limiting the potential impact of reconfigurable computing (RC) with FPGAs in high-performance computing. Major challenges with HDL design include steep learning curves, large and complex codes, long compilation times, and lack of development standards across platforms. A relative newcomer to RC, the Open Computing Language (OpenCL) reduces productivity hurdles by providing a platform-independent, C-based programming language. In this study, we conduct a performance and productivity comparison between three image-processing kernels (Canny edge detector, Sobel filter, and SURF feature-extractor) developed using Altera's SDK for OpenCL and traditional VHDL. Our results show that VHDL designs achieved a more efficient use of resources (59% to 70% less logic), however, both OpenCL and VHDL designs resulted in similar timing constraints (255MHz <; fmax <; 325MHz). Furthermore, we observed a 6× increase in productivity when using OpenCL development tools, as well as the ability to efficiently port the same OpenCL designs without change to three different RC platforms, with similar performance in terms of frequency and resource utilization.
{"title":"Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA","authors":"K. Hill, S. Craciun, A. George, H. Lam","doi":"10.1109/ASAP.2015.7245733","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245733","url":null,"abstract":"Application development with hardware description languages (HDLs) such as VHDL or Verilog involves numerous productivity challenges, limiting the potential impact of reconfigurable computing (RC) with FPGAs in high-performance computing. Major challenges with HDL design include steep learning curves, large and complex codes, long compilation times, and lack of development standards across platforms. A relative newcomer to RC, the Open Computing Language (OpenCL) reduces productivity hurdles by providing a platform-independent, C-based programming language. In this study, we conduct a performance and productivity comparison between three image-processing kernels (Canny edge detector, Sobel filter, and SURF feature-extractor) developed using Altera's SDK for OpenCL and traditional VHDL. Our results show that VHDL designs achieved a more efficient use of resources (59% to 70% less logic), however, both OpenCL and VHDL designs resulted in similar timing constraints (255MHz <; fmax <; 325MHz). Furthermore, we observed a 6× increase in productivity when using OpenCL development tools, as well as the ability to efficiently port the same OpenCL designs without change to three different RC platforms, with similar performance in terms of frequency and resource utilization.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"189-193"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81013115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}