Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245703
Nitin Gawande, J. Manzano, Antonino Tumeo, Nathan R. Tallent, D. Kerbyson, A. Hoisie
Power efficiency - performance relative to power - is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP's computationally intensive kernels across the two hardware testbeds. We discuss an efficient parallel implementation for the Haswell CPU architecture. We also show the impact and trade-offs of GPU optimization techniques. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. Finally, we show that a balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.
{"title":"Power and performance trade-offs for Space Time Adaptive Processing","authors":"Nitin Gawande, J. Manzano, Antonino Tumeo, Nathan R. Tallent, D. Kerbyson, A. Hoisie","doi":"10.1109/ASAP.2015.7245703","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245703","url":null,"abstract":"Power efficiency - performance relative to power - is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP's computationally intensive kernels across the two hardware testbeds. We discuss an efficient parallel implementation for the Haswell CPU architecture. We also show the impact and trade-offs of GPU optimization techniques. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. Finally, we show that a balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"46 1","pages":"41-48"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84695849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245699
R. Polig, Heiner Giefers, W. Stechele
Despite the performance and power efficiency gains achieved by FPGAs for text analytics queries, analysis shows a low utilization of the custom hardware operator modules. Furthermore the long synthesis times limit the accelerator's use in enterprise systems to static queries. To overcome these limitations we propose the use of an overlay architecture to share area resources among multiple operators and reduce compilation times. In this paper we present a novel soft-core architecture tailored to efficiently perform relational operations of text analytics queries on multiple virtual streams. It combines the ability to perform efficient streaming based operations while adding the flexibility of an instruction programmable core. It is used as a processing element in an array of cores to execute large query graphs and has access to shared co-processors to perform string-and context-based operations. We evaluate the core architecture in terms of area and performance compared to the custom hardware modules, and show how a minimum number of cores can be calculated to avoid stalling the document processing.
{"title":"A soft-core processor array for relational operators","authors":"R. Polig, Heiner Giefers, W. Stechele","doi":"10.1109/ASAP.2015.7245699","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245699","url":null,"abstract":"Despite the performance and power efficiency gains achieved by FPGAs for text analytics queries, analysis shows a low utilization of the custom hardware operator modules. Furthermore the long synthesis times limit the accelerator's use in enterprise systems to static queries. To overcome these limitations we propose the use of an overlay architecture to share area resources among multiple operators and reduce compilation times. In this paper we present a novel soft-core architecture tailored to efficiently perform relational operations of text analytics queries on multiple virtual streams. It combines the ability to perform efficient streaming based operations while adding the flexibility of an instruction programmable core. It is used as a processing element in an array of cores to execute large query graphs and has access to shared co-processors to perform string-and context-based operations. We evaluate the core architecture in terms of area and performance compared to the custom hardware modules, and show how a minimum number of cores can be calculated to avoid stalling the document processing.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"10 1","pages":"17-24"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89859993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245709
Bingzhe Li, M. Najafi, D. Lilja
Artificial neural networks (ANNs) usually require a very large number of computation nodes and can be implemented either in software or directly in hardware, such as FPGAs. Software-based approaches are offline and not suitable for real-time applications, but they support a large number of nodes. FPGA-based implementations, in contrast, can greatly speedup the computation time. However, resource limitations in an FPGA restrict the maximum number of computation nodes in hardware-based approaches. This work exploits stochastic bit streams to implement the Restricted Boltzmann Machine (RBM) handwritten digit recognition application completely on an FPGA. Exploiting this approach saves a large number of hardware resources making the FPGA-based implementation of large ANNs feasible.
{"title":"An FPGA implementation of a Restricted Boltzmann Machine classifier using stochastic bit streams","authors":"Bingzhe Li, M. Najafi, D. Lilja","doi":"10.1109/ASAP.2015.7245709","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245709","url":null,"abstract":"Artificial neural networks (ANNs) usually require a very large number of computation nodes and can be implemented either in software or directly in hardware, such as FPGAs. Software-based approaches are offline and not suitable for real-time applications, but they support a large number of nodes. FPGA-based implementations, in contrast, can greatly speedup the computation time. However, resource limitations in an FPGA restrict the maximum number of computation nodes in hardware-based approaches. This work exploits stochastic bit streams to implement the Restricted Boltzmann Machine (RBM) handwritten digit recognition application completely on an FPGA. Exploiting this approach saves a large number of hardware resources making the FPGA-based implementation of large ANNs feasible.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"21 1","pages":"68-69"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77722783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245730
Moritz Schmid, Oliver Reiche, Frank Hannig, J. Teich
Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.
{"title":"Loop coarsening in C-based High-Level Synthesis","authors":"Moritz Schmid, Oliver Reiche, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2015.7245730","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245730","url":null,"abstract":"Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"41 1","pages":"166-173"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85806849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245716
Ming-Ju Wu, Yan-Ting Chen, Chun-Jen Tsai
In this paper, we have implemented a dynamic pipeline-partitioning video decoder for the symmetric stream multiprocessor (SSMP) architecture. The SSMP architecture extends the traditional symmetric multiprocessor (SMP) architecture with dedicated per-core scratchpad memories and inter-processor communication (IPC) controllers for efficient data passing between the processor cores. The SSMP architecture allows the processor cores to cooperate efficiently in a fine-grained software pipeline fashion. A traditional software pipelined video decoder has fixed pipeline-stage partitions. The AVC/H.264 video decoder investigated in this paper dynamically assigns different stages of the video macroblock (MB) decoding tasks to different processor cores in order to maintain load balance among the processor cores. The pipeline partitioning policy is based on the queue levels of the inter-stage buffers. Experimental results show that, on average, the proposed dynamic pipeline-partitioning video decoder is 34% faster compared to a wavefront-based parallel video decoder.
{"title":"Dynamic pipeline-partitioned video decoding on symmetric stream multiprocessors","authors":"Ming-Ju Wu, Yan-Ting Chen, Chun-Jen Tsai","doi":"10.1109/ASAP.2015.7245716","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245716","url":null,"abstract":"In this paper, we have implemented a dynamic pipeline-partitioning video decoder for the symmetric stream multiprocessor (SSMP) architecture. The SSMP architecture extends the traditional symmetric multiprocessor (SMP) architecture with dedicated per-core scratchpad memories and inter-processor communication (IPC) controllers for efficient data passing between the processor cores. The SSMP architecture allows the processor cores to cooperate efficiently in a fine-grained software pipeline fashion. A traditional software pipelined video decoder has fixed pipeline-stage partitions. The AVC/H.264 video decoder investigated in this paper dynamically assigns different stages of the video macroblock (MB) decoding tasks to different processor cores in order to maintain load balance among the processor cores. The pipeline partitioning policy is based on the queue levels of the inter-stage buffers. Experimental results show that, on average, the proposed dynamic pipeline-partitioning video decoder is 34% faster compared to a wavefront-based parallel video decoder.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"21 1","pages":"106-110"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75211891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245726
Zhinan Cheng, Xi Li, Beilei Sun, Ce Gao, Jiachen Song
The rapid development of mobile games highlights the power consumption problem in the mobile platform. Most of the power saving techniques use the prediction-based dynamic voltage frequency scaling (DVFS) scheme. However, the prediction could be inaccurate resulting from the frequent interactions of user when playing games. We have observed that frame rate is near-linear to CPU frequency, but there is a bottleneck, frame rate will not increase as CPU frequency increases when CPU frequency reaches this threshold. Moreover, previous research has shown that utilizing the information of game state can reduce the influence of game interactive characterization to DVFS policy. We explore a method to automatically detect the game state. We propose the Automatic Frame Rate-Based DVFS policy, which can learn the threshold of frame rate online and utilize the information of game state and frame rate to scale the frequency without prediction. Our evaluation result shows that, compared with the prediction-based Android default Interactive DVFS policy, our policy saves more power in all the testing games. Up to 15.2% more power can be saved by Automatic Frame Rate-Based DVFS policy.
{"title":"Automatic frame rate-based DVFS of game","authors":"Zhinan Cheng, Xi Li, Beilei Sun, Ce Gao, Jiachen Song","doi":"10.1109/ASAP.2015.7245726","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245726","url":null,"abstract":"The rapid development of mobile games highlights the power consumption problem in the mobile platform. Most of the power saving techniques use the prediction-based dynamic voltage frequency scaling (DVFS) scheme. However, the prediction could be inaccurate resulting from the frequent interactions of user when playing games. We have observed that frame rate is near-linear to CPU frequency, but there is a bottleneck, frame rate will not increase as CPU frequency increases when CPU frequency reaches this threshold. Moreover, previous research has shown that utilizing the information of game state can reduce the influence of game interactive characterization to DVFS policy. We explore a method to automatically detect the game state. We propose the Automatic Frame Rate-Based DVFS policy, which can learn the threshold of frame rate online and utilize the information of game state and frame rate to scale the frequency without prediction. Our evaluation result shows that, compared with the prediction-based Android default Interactive DVFS policy, our policy saves more power in all the testing games. Up to 15.2% more power can be saved by Automatic Frame Rate-Based DVFS policy.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"393 1","pages":"158-159"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76430526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245739
A. J. Wong, S. Hemati, W. Gross
High-speed and low-area decoders for low-density parity-check (LDPC) codes with very long block lengths are challenging to implement due to the large amount of nodes and edges required. In this paper we implement a decoder for a (32643, 30592) LDPC code that has variable nodes of degree 7, check nodes degrees of 111 and 112, and 228501 edges, making fully-parallel hardware implementation unfeasible. We analyze the structure of this code and describe a method of replacing the complex interconnect with a local, area-efficient version. We develop an modular architecture resulting in a low-complexity partially-parallel decoder architecture based on the offset min-sum algorithm. The proposed decoder is shown to achieve a minimum gain of 92% in area utilization, compared to an extremely optimistic area estimation of the fully-parallel decoder that neglects the interconnection overhead. Synthesis in 65 nm CMOS is performed resulting in a clock frequency of 370 MHz and a throughput of 24 Gbps with an area of 7.99 mm2.
{"title":"Efficient implementation of structured long block-length LDPC codes","authors":"A. J. Wong, S. Hemati, W. Gross","doi":"10.1109/ASAP.2015.7245739","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245739","url":null,"abstract":"High-speed and low-area decoders for low-density parity-check (LDPC) codes with very long block lengths are challenging to implement due to the large amount of nodes and edges required. In this paper we implement a decoder for a (32643, 30592) LDPC code that has variable nodes of degree 7, check nodes degrees of 111 and 112, and 228501 edges, making fully-parallel hardware implementation unfeasible. We analyze the structure of this code and describe a method of replacing the complex interconnect with a local, area-efficient version. We develop an modular architecture resulting in a low-complexity partially-parallel decoder architecture based on the offset min-sum algorithm. The proposed decoder is shown to achieve a minimum gain of 92% in area utilization, compared to an extremely optimistic area estimation of the fully-parallel decoder that neglects the interconnection overhead. Synthesis in 65 nm CMOS is performed resulting in a clock frequency of 370 MHz and a throughput of 24 Gbps with an area of 7.99 mm2.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"63 1 1","pages":"234-238"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85484438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245707
W. He, Dirmanto Jap
The security of the implemented cryptographic module in hardware has seen severe vulnerabilities against Side-Channel Attack (SCA), which is capable of retrieving hidden things by observing the pattern or quantity of unintentional information leakage. Dual-rail Precharge Logic (DPL) theoretically thwarts side-channel analyses by its low-level compensation manner, while the security reliability of DPLs can only be achieved at high resource expenses and degraded performance. In this paper, we present a dynamic protection system for selectively configuring the security-sensitive crypto modules to SCA-resistant dual-rail style in the scenario that the real-time threat is detected. The threat-response mechanism helps to dynamically balance the security and cost. The system is driven by a set of automated dual-rail conversion APIs for partially transforming the cryptographic module into its dual-rail format, particularly to a highly secure symmetric and interleaved placement. The elevated security grade from the safe to threat mode is validated by EM based mutual information analysis using fine-grained surface scan to a decapsulated Virtex-5 FPGA on SASEBO GII board.
{"title":"Dual-rail active protection system against side-channel analysis in FPGAs","authors":"W. He, Dirmanto Jap","doi":"10.1109/ASAP.2015.7245707","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245707","url":null,"abstract":"The security of the implemented cryptographic module in hardware has seen severe vulnerabilities against Side-Channel Attack (SCA), which is capable of retrieving hidden things by observing the pattern or quantity of unintentional information leakage. Dual-rail Precharge Logic (DPL) theoretically thwarts side-channel analyses by its low-level compensation manner, while the security reliability of DPLs can only be achieved at high resource expenses and degraded performance. In this paper, we present a dynamic protection system for selectively configuring the security-sensitive crypto modules to SCA-resistant dual-rail style in the scenario that the real-time threat is detected. The threat-response mechanism helps to dynamically balance the security and cost. The system is driven by a set of automated dual-rail conversion APIs for partially transforming the cryptographic module into its dual-rail format, particularly to a highly secure symmetric and interleaved placement. The elevated security grade from the safe to threat mode is validated by EM based mutual information analysis using fine-grained surface scan to a decapsulated Virtex-5 FPGA on SASEBO GII board.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"32 1","pages":"64-65"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90593447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245712
Hugues de Lassus Saint-Geniès, D. Defour, G. Revy
Software evaluation of elementary functions usually requires three steps: a range reduction, a polynomial evaluation, and a reconstruction step. These evaluation schemes are designed to give the best performance for a given accuracy, which requires a fine control of errors. One of the main issues is to minimize the number of sources of error and/or their influence on the final result. The work presented in this article addresses this problem as it removes one source of error for the evaluation of trigonometric functions. We propose a method that eliminates rounding errors from tabulated values used in the second range reduction for the sine and cosine evaluation. When targeting correct rounding, we show that such tables are smaller and make the reconstruction step less expensive than existing methods. This approach relies on Pythagorean triples generators. Finally, we show how to generate tables indexed by up to 10 bits in a reasonable time and with little memory consumption.
{"title":"Range reduction based on Pythagorean triples for trigonometric function evaluation","authors":"Hugues de Lassus Saint-Geniès, D. Defour, G. Revy","doi":"10.1109/ASAP.2015.7245712","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245712","url":null,"abstract":"Software evaluation of elementary functions usually requires three steps: a range reduction, a polynomial evaluation, and a reconstruction step. These evaluation schemes are designed to give the best performance for a given accuracy, which requires a fine control of errors. One of the main issues is to minimize the number of sources of error and/or their influence on the final result. The work presented in this article addresses this problem as it removes one source of error for the evaluation of trigonometric functions. We propose a method that eliminates rounding errors from tabulated values used in the second range reduction for the sine and cosine evaluation. When targeting correct rounding, we show that such tables are smaller and make the reconstruction step less expensive than existing methods. This approach relies on Pythagorean triples generators. Finally, we show how to generate tables indexed by up to 10 bits in a reasonable time and with little memory consumption.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"74-81"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83453715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-07-27DOI: 10.1109/ASAP.2015.7245697
Cecilia González-Alvarez, Jennifer B. Sartor, C. Álvarez, Daniel Jiménez-González, L. Eeckhout
This paper explores hardware specialization of low-power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed-up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas.
{"title":"Automatic design of domain-specific instructions for low-power processors","authors":"Cecilia González-Alvarez, Jennifer B. Sartor, C. Álvarez, Daniel Jiménez-González, L. Eeckhout","doi":"10.1109/ASAP.2015.7245697","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245697","url":null,"abstract":"This paper explores hardware specialization of low-power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed-up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"146 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88091323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}