2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

Power and performance trade-offs for Space Time Adaptive Processing 时空自适应处理的功率和性能权衡

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245703

Nitin Gawande, J. Manzano, Antonino Tumeo, Nathan R. Tallent, D. Kerbyson, A. Hoisie

Power efficiency - performance relative to power - is one of the most important concerns when designing RADAR processing systems. This paper analyzes power and performance trade-offs for a typical Space Time Adaptive Processing (STAP) application. We study STAP implementations for CUDA and OpenMP on two architectures, Intel Haswell Core I7-4770TE and NVIDIA Kayla with a GK208 GPU. We analyze the power and performance of STAP's computationally intensive kernels across the two hardware testbeds. We discuss an efficient parallel implementation for the Haswell CPU architecture. We also show the impact and trade-offs of GPU optimization techniques. The GPU architecture is able to process large size data sets without increase in power requirement. The use of shared memory has a significant impact on the power requirement for the GPU. Finally, we show that a balance between the use of shared memory and main memory access leads to an improved performance in a typical STAP application.

功率效率-相对于功率的性能-是设计雷达处理系统时最重要的关注点之一。本文分析了典型的时空自适应处理(STAP)应用的功耗和性能权衡。我们研究了CUDA和OpenMP在两种架构上的STAP实现，Intel Haswell Core I7-4770TE和NVIDIA Kayla与GK208 GPU。我们在两个硬件测试平台上分析了STAP计算密集型内核的功率和性能。我们讨论了Haswell CPU架构的高效并行实现。我们还展示了GPU优化技术的影响和权衡。GPU架构能够在不增加功耗需求的情况下处理大型数据集。共享内存的使用对GPU的电源需求有很大的影响。最后，我们展示了在典型的STAP应用程序中，使用共享内存和访问主内存之间的平衡可以提高性能。

引用次数: 4

A soft-core processor array for relational operators 关系运算符的软核处理器数组

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245699

R. Polig, Heiner Giefers, W. Stechele

Despite the performance and power efficiency gains achieved by FPGAs for text analytics queries, analysis shows a low utilization of the custom hardware operator modules. Furthermore the long synthesis times limit the accelerator's use in enterprise systems to static queries. To overcome these limitations we propose the use of an overlay architecture to share area resources among multiple operators and reduce compilation times. In this paper we present a novel soft-core architecture tailored to efficiently perform relational operations of text analytics queries on multiple virtual streams. It combines the ability to perform efficient streaming based operations while adding the flexibility of an instruction programmable core. It is used as a processing element in an array of cores to execute large query graphs and has access to shared co-processors to perform string-and context-based operations. We evaluate the core architecture in terms of area and performance compared to the custom hardware modules, and show how a minimum number of cores can be calculated to avoid stalling the document processing.

尽管fpga在文本分析查询方面实现了性能和功率效率的提高，但分析显示，自定义硬件操作符模块的利用率很低。此外，较长的合成时间限制了加速器在企业系统中的使用，只能使用静态查询。为了克服这些限制，我们提出使用覆盖架构在多个运营商之间共享区域资源并减少编译时间。在本文中，我们提出了一种新的软核架构，用于在多个虚拟流上有效地执行文本分析查询的关系操作。它结合了执行高效的基于流的操作的能力，同时增加了指令可编程核心的灵活性。它被用作内核数组中的处理元素，用于执行大型查询图，并可以访问共享协处理器，以执行基于字符串和上下文的操作。与自定义硬件模块相比，我们在面积和性能方面评估了核心架构，并展示了如何计算最小核心数量以避免文档处理停滞。

{"title":"A soft-core processor array for relational operators","authors":"R. Polig, Heiner Giefers, W. Stechele","doi":"10.1109/ASAP.2015.7245699","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245699","url":null,"abstract":"Despite the performance and power efficiency gains achieved by FPGAs for text analytics queries, analysis shows a low utilization of the custom hardware operator modules. Furthermore the long synthesis times limit the accelerator's use in enterprise systems to static queries. To overcome these limitations we propose the use of an overlay architecture to share area resources among multiple operators and reduce compilation times. In this paper we present a novel soft-core architecture tailored to efficiently perform relational operations of text analytics queries on multiple virtual streams. It combines the ability to perform efficient streaming based operations while adding the flexibility of an instruction programmable core. It is used as a processing element in an array of cores to execute large query graphs and has access to shared co-processors to perform string-and context-based operations. We evaluate the core architecture in terms of area and performance compared to the custom hardware modules, and show how a minimum number of cores can be calculated to avoid stalling the document processing.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"10 1","pages":"17-24"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89859993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

An FPGA implementation of a Restricted Boltzmann Machine classifier using stochastic bit streams 使用随机比特流的受限玻尔兹曼机分类器的FPGA实现

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245709

Bingzhe Li, M. Najafi, D. Lilja

Artificial neural networks (ANNs) usually require a very large number of computation nodes and can be implemented either in software or directly in hardware, such as FPGAs. Software-based approaches are offline and not suitable for real-time applications, but they support a large number of nodes. FPGA-based implementations, in contrast, can greatly speedup the computation time. However, resource limitations in an FPGA restrict the maximum number of computation nodes in hardware-based approaches. This work exploits stochastic bit streams to implement the Restricted Boltzmann Machine (RBM) handwritten digit recognition application completely on an FPGA. Exploiting this approach saves a large number of hardware resources making the FPGA-based implementation of large ANNs feasible.

人工神经网络通常需要大量的计算节点，可以在软件中实现，也可以直接在硬件(如fpga)中实现。基于软件的方法是离线的，不适合实时应用程序，但它们支持大量节点。相反，基于fpga的实现可以大大加快计算时间。然而，FPGA中的资源限制限制了基于硬件的方法中计算节点的最大数量。本文利用随机比特流在FPGA上完全实现了受限玻尔兹曼机(RBM)手写数字识别应用。利用这种方法节省了大量的硬件资源，使得基于fpga的大型人工神经网络的实现成为可能。

引用次数: 27

Loop coarsening in C-based High-Level Synthesis c基高阶合成中的环粗化

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245730

Moritz Schmid, Oliver Reiche, Frank Hannig, J. Teich

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.

当前用于高级综合(HLS)的工具擅长利用指令级并行性(ILP)，相比之下，对现场可编程门阵列(fpga)的关键优势之一数据级并行性(DLP)的支持非常有限。这项工作研究了DLP在fpga上的利用，使用基于c的HLS图像过滤器和流管道的代码生成，由点和局部算子组成。除了众所周知的循环平铺技术，我们提出循环粗化，它提供了卓越的性能和可扩展性。循环平铺对应于将图像分割成单独的区域，然后由复制的加速器并行处理。对于数据流，这也需要生成粘合逻辑，用于图像数据的分布。相反，循环粗化允许并行处理多个像素，因此在单个加速器中只复制内核操作符。我们通过循环粗化来增强异构领域特定语言(DSL)框架hipac的FPGA后端，并将结果FPGA加速器与图形处理单元(gpu)的高度优化软件实现进行比较，所有这些都是从完全相同的代码库生成的。此外，我们通过概述由hipac支持的设计空间探索如何产生比手工编码的VHDL更有效的实现，展示了代码生成用于算法开发的优势。

{"title":"Loop coarsening in C-based High-Level Synthesis","authors":"Moritz Schmid, Oliver Reiche, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2015.7245730","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245730","url":null,"abstract":"Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and local operators. In addition to well known loop tiling techniques, we propose loop coarsening, which delivers superior performance and scalability. Loop tiling corresponds to splitting an image into separate regions, which are then processed in parallel by replicated accelerators. For data streaming, this also requires the generation of glue logic for the distribution of image data. Conversely, loop coarsening allows to process multiple pixels in parallel, whereby only the kernel operator is replicated within a single accelerator. We augment the FPGA back end of the heterogeneous Domain-Specific Language (DSL) framework HIPAcc by loop coarsening and compare the resulting FPGA accelerators to highly optimized software implementations for Graphics Processing Units (GPUs), all generated from the exact same code base. Moreover, we demonstrate the advantages of code generation for algorithm development by outlining how design space exploration enabled by HIPAcc can yield a more efficient implementation than hand-coded VHDL.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"41 1","pages":"166-173"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85806849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Dynamic pipeline-partitioned video decoding on symmetric stream multiprocessors 对称流多处理器上的动态流水线分割视频解码

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245716

Ming-Ju Wu, Yan-Ting Chen, Chun-Jen Tsai

In this paper, we have implemented a dynamic pipeline-partitioning video decoder for the symmetric stream multiprocessor (SSMP) architecture. The SSMP architecture extends the traditional symmetric multiprocessor (SMP) architecture with dedicated per-core scratchpad memories and inter-processor communication (IPC) controllers for efficient data passing between the processor cores. The SSMP architecture allows the processor cores to cooperate efficiently in a fine-grained software pipeline fashion. A traditional software pipelined video decoder has fixed pipeline-stage partitions. The AVC/H.264 video decoder investigated in this paper dynamically assigns different stages of the video macroblock (MB) decoding tasks to different processor cores in order to maintain load balance among the processor cores. The pipeline partitioning policy is based on the queue levels of the inter-stage buffers. Experimental results show that, on average, the proposed dynamic pipeline-partitioning video decoder is 34% faster compared to a wavefront-based parallel video decoder.

在本文中，我们为对称流多处理器(SSMP)架构实现了一个动态管道分割视频解码器。SSMP体系结构扩展了传统的对称多处理器(SMP)体系结构，具有专用的每核刮刮板存储器和处理器间通信(IPC)控制器，可在处理器内核之间有效地传递数据。SSMP体系结构允许处理器内核以细粒度的软件管道方式有效地协作。传统的软件流水线视频解码器具有固定的流水线级分区。AVC / H。本文研究的264视频解码器将视频宏块(MB)解码任务的不同阶段动态分配给不同的处理器内核，以保持处理器内核之间的负载平衡。管道分区策略基于阶段间缓冲区的队列级别。实验结果表明，与基于波前的并行视频解码器相比，所提出的动态管道分割视频解码器的平均速度快34%。

引用次数: 0

Automatic frame rate-based DVFS of game 自动基于帧率的游戏DVFS

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245726

Zhinan Cheng, Xi Li, Beilei Sun, Ce Gao, Jiachen Song

The rapid development of mobile games highlights the power consumption problem in the mobile platform. Most of the power saving techniques use the prediction-based dynamic voltage frequency scaling (DVFS) scheme. However, the prediction could be inaccurate resulting from the frequent interactions of user when playing games. We have observed that frame rate is near-linear to CPU frequency, but there is a bottleneck, frame rate will not increase as CPU frequency increases when CPU frequency reaches this threshold. Moreover, previous research has shown that utilizing the information of game state can reduce the influence of game interactive characterization to DVFS policy. We explore a method to automatically detect the game state. We propose the Automatic Frame Rate-Based DVFS policy, which can learn the threshold of frame rate online and utilize the information of game state and frame rate to scale the frequency without prediction. Our evaluation result shows that, compared with the prediction-based Android default Interactive DVFS policy, our policy saves more power in all the testing games. Up to 15.2% more power can be saved by Automatic Frame Rate-Based DVFS policy.

手机游戏的快速发展凸显了手机平台的功耗问题。大多数节能技术采用基于预测的动态电压频率缩放(DVFS)方案。然而，由于用户在玩游戏时的频繁互动，预测可能会不准确。我们观察到帧率与CPU频率接近线性，但存在瓶颈，当CPU频率达到该阈值时，帧率不会随着CPU频率的增加而增加。此外，已有研究表明，利用博弈状态信息可以减少博弈交互表征对DVFS策略的影响。我们探索了一种自动检测游戏状态的方法。提出了一种基于自动帧率的DVFS策略，该策略可以在线学习帧率阈值，并利用游戏状态和帧率信息在不预测的情况下缩放频率。我们的评估结果表明，与基于预测的Android默认Interactive DVFS策略相比，我们的策略在所有测试游戏中都更省电。基于自动帧率的DVFS策略可以节省高达15.2%的电力。

{"title":"Automatic frame rate-based DVFS of game","authors":"Zhinan Cheng, Xi Li, Beilei Sun, Ce Gao, Jiachen Song","doi":"10.1109/ASAP.2015.7245726","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245726","url":null,"abstract":"The rapid development of mobile games highlights the power consumption problem in the mobile platform. Most of the power saving techniques use the prediction-based dynamic voltage frequency scaling (DVFS) scheme. However, the prediction could be inaccurate resulting from the frequent interactions of user when playing games. We have observed that frame rate is near-linear to CPU frequency, but there is a bottleneck, frame rate will not increase as CPU frequency increases when CPU frequency reaches this threshold. Moreover, previous research has shown that utilizing the information of game state can reduce the influence of game interactive characterization to DVFS policy. We explore a method to automatically detect the game state. We propose the Automatic Frame Rate-Based DVFS policy, which can learn the threshold of frame rate online and utilize the information of game state and frame rate to scale the frequency without prediction. Our evaluation result shows that, compared with the prediction-based Android default Interactive DVFS policy, our policy saves more power in all the testing games. Up to 15.2% more power can be saved by Automatic Frame Rate-Based DVFS policy.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"393 1","pages":"158-159"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76430526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Efficient implementation of structured long block-length LDPC codes 高效实现结构化长块长度LDPC码

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245739

A. J. Wong, S. Hemati, W. Gross

High-speed and low-area decoders for low-density parity-check (LDPC) codes with very long block lengths are challenging to implement due to the large amount of nodes and edges required. In this paper we implement a decoder for a (32643, 30592) LDPC code that has variable nodes of degree 7, check nodes degrees of 111 and 112, and 228501 edges, making fully-parallel hardware implementation unfeasible. We analyze the structure of this code and describe a method of replacing the complex interconnect with a local, area-efficient version. We develop an modular architecture resulting in a low-complexity partially-parallel decoder architecture based on the offset min-sum algorithm. The proposed decoder is shown to achieve a minimum gain of 92% in area utilization, compared to an extremely optimistic area estimation of the fully-parallel decoder that neglects the interconnection overhead. Synthesis in 65 nm CMOS is performed resulting in a clock frequency of 370 MHz and a throughput of 24 Gbps with an area of 7.99 mm2.

由于需要大量的节点和边，对于具有很长块长度的低密度奇偶校验(LDPC)码的高速和低面积解码器很难实现。在本文中，我们实现了一个(32643,30592)LDPC代码的解码器，该代码具有可变节点度为7，检查节点度为111和112，以及228501条边，使得完全并行的硬件实现不可行。我们分析了该代码的结构，并描述了一种用局部、面积有效的版本取代复杂互连的方法。我们开发了一种基于偏移最小和算法的模块化架构，从而实现了低复杂度的部分并行解码器架构。与忽略互连开销的全并行解码器的极其乐观的面积估计相比，所提出的解码器在面积利用率方面实现了92%的最小增益。在65纳米CMOS中进行的合成导致时钟频率为370 MHz，吞吐量为24 Gbps，面积为7.99 mm2。

引用次数: 3

Dual-rail active protection system against side-channel analysis in FPGAs fpga双轨主动保护系统侧道分析

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245707

W. He, Dirmanto Jap

The security of the implemented cryptographic module in hardware has seen severe vulnerabilities against Side-Channel Attack (SCA), which is capable of retrieving hidden things by observing the pattern or quantity of unintentional information leakage. Dual-rail Precharge Logic (DPL) theoretically thwarts side-channel analyses by its low-level compensation manner, while the security reliability of DPLs can only be achieved at high resource expenses and degraded performance. In this paper, we present a dynamic protection system for selectively configuring the security-sensitive crypto modules to SCA-resistant dual-rail style in the scenario that the real-time threat is detected. The threat-response mechanism helps to dynamically balance the security and cost. The system is driven by a set of automated dual-rail conversion APIs for partially transforming the cryptographic module into its dual-rail format, particularly to a highly secure symmetric and interleaved placement. The elevated security grade from the safe to threat mode is validated by EM based mutual information analysis using fine-grained surface scan to a decapsulated Virtex-5 FPGA on SASEBO GII board.

硬件中实现的加密模块的安全性存在严重的侧信道攻击(SCA)漏洞，这种攻击能够通过观察无意信息泄漏的模式或数量来检索隐藏的东西。双轨预充逻辑(Dual-rail Precharge Logic, DPL)通过其低水平补偿方式在理论上阻碍了侧信道分析，而DPL的安全可靠性只有在高资源消耗和性能下降的情况下才能实现。本文提出了一种动态保护系统，用于在检测到实时威胁的情况下，选择性地将安全敏感的加密模块配置为抗sca的双轨道样式。威胁响应机制可以实现安全与成本的动态平衡。该系统由一组自动双轨转换api驱动，用于将加密模块部分转换为其双轨格式，特别是高度安全的对称和交错放置。通过基于EM的互信息分析，对SASEBO GII板上解封装的Virtex-5 FPGA进行细粒度表面扫描，验证了从安全到威胁模式的安全等级提升。

{"title":"Dual-rail active protection system against side-channel analysis in FPGAs","authors":"W. He, Dirmanto Jap","doi":"10.1109/ASAP.2015.7245707","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245707","url":null,"abstract":"The security of the implemented cryptographic module in hardware has seen severe vulnerabilities against Side-Channel Attack (SCA), which is capable of retrieving hidden things by observing the pattern or quantity of unintentional information leakage. Dual-rail Precharge Logic (DPL) theoretically thwarts side-channel analyses by its low-level compensation manner, while the security reliability of DPLs can only be achieved at high resource expenses and degraded performance. In this paper, we present a dynamic protection system for selectively configuring the security-sensitive crypto modules to SCA-resistant dual-rail style in the scenario that the real-time threat is detected. The threat-response mechanism helps to dynamically balance the security and cost. The system is driven by a set of automated dual-rail conversion APIs for partially transforming the cryptographic module into its dual-rail format, particularly to a highly secure symmetric and interleaved placement. The elevated security grade from the safe to threat mode is validated by EM based mutual information analysis using fine-grained surface scan to a decapsulated Virtex-5 FPGA on SASEBO GII board.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"32 1","pages":"64-65"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90593447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Range reduction based on Pythagorean triples for trigonometric function evaluation 基于三角函数评估的毕达哥拉斯三元组的范围缩减

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245712

Hugues de Lassus Saint-Geniès, D. Defour, G. Revy

Software evaluation of elementary functions usually requires three steps: a range reduction, a polynomial evaluation, and a reconstruction step. These evaluation schemes are designed to give the best performance for a given accuracy, which requires a fine control of errors. One of the main issues is to minimize the number of sources of error and/or their influence on the final result. The work presented in this article addresses this problem as it removes one source of error for the evaluation of trigonometric functions. We propose a method that eliminates rounding errors from tabulated values used in the second range reduction for the sine and cosine evaluation. When targeting correct rounding, we show that such tables are smaller and make the reconstruction step less expensive than existing methods. This approach relies on Pythagorean triples generators. Finally, we show how to generate tables indexed by up to 10 bits in a reasonable time and with little memory consumption.

初等函数的软件评估通常需要三个步骤:范围缩减、多项式评估和重建步骤。这些评估方案旨在为给定的精度提供最佳性能，这需要对误差进行精细控制。其中一个主要问题是尽量减少错误来源的数量和/或它们对最终结果的影响。本文提出的工作解决了这个问题，因为它消除了三角函数评估的一个误差来源。我们提出了一种方法，消除了在正弦和余弦计算的第二次范围缩减中使用的表格值的舍入误差。当以正确舍入为目标时，我们表明这样的表更小，并且使重建步骤比现有方法更便宜。这种方法依赖于毕达哥拉斯三元组生成器。最后，我们将展示如何在合理的时间内以最少的内存消耗生成最多10位索引的表。

引用次数: 5

Automatic design of domain-specific instructions for low-power processors 低功耗处理器领域特定指令的自动设计

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2015-07-27 DOI: 10.1109/ASAP.2015.7245697

Cecilia González-Alvarez, Jennifer B. Sartor, C. Álvarez, Daniel Jiménez-González, L. Eeckhout

This paper explores hardware specialization of low-power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed-up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas.

本文探讨了低功耗处理器的硬件专业化，以提高性能和能源效率。我们的主要贡献是一个自动化框架，它可以在循环体级别分析域内应用程序的指令序列，并识别跨应用程序的精确和部分匹配的序列，这些序列可以成为自定义指令。我们的框架将序列转换为一种新的代码抽象，即合并图，它提高了相似性识别，将潜在自定义指令的相似组聚类以有效地减少搜索空间，并选择合并的自定义指令以有效地利用可用的可定制区域。对于一组11个媒体应用程序，我们的快速框架生成的指令显着改善了能量延迟产品和加速，与分析基本块内序列的技术相比，节省了两倍以上。本文表明，部分匹配的定制指令不会显著增加设计时间，对于在有限的硬件区域实现更高的能源效率至关重要。

{"title":"Automatic design of domain-specific instructions for low-power processors","authors":"Cecilia González-Alvarez, Jennifer B. Sartor, C. Álvarez, Daniel Jiménez-González, L. Eeckhout","doi":"10.1109/ASAP.2015.7245697","DOIUrl":"https://doi.org/10.1109/ASAP.2015.7245697","url":null,"abstract":"This paper explores hardware specialization of low-power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed-up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas.","PeriodicalId":6642,"journal":{"name":"2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"146 1","pages":"1-8"},"PeriodicalIF":0.0,"publicationDate":"2015-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88091323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀