首页 > 最新文献

2010 IEEE 8th Symposium on Application Specific Processors (SASP)最新文献

英文 中文
A novel configuration circuit architecture to speedup reconfiguration and relocation for partially reconfigurable devices 一种新的组态电路结构,可加速部分可重构器件的重配置和重定位
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521140
T. Marconi, J. Hur, K. Bertels, G. Gaydadjiev
Long reconfiguration times form a major bottleneck in dynamic reconfigurable systems. Many approaches have been proposed to address this problem. However, improvements in the configuration circuit that introduces this overhead are usually not considered. The high reconfiguration times are due to the large amount of configuration bits sent through a constrained data path. In order to alleviate this, we propose a novel FPGA configuration circuit architecture to speedup bitstream (re)configuration and relocation. Experimental results using the MCNC benchmark set indicate that our proposal reconfigures 4 times faster and relocates 19.8 times more efficient compared to the state of the art approaches. This is achieved by transporting only the data required for the configuration in flight and by avoiding external communication while relocating. Moreover, the configuration bitstream sizes of the evaluated benchmarks are reduced by 65%on average. In addition, our proposal introduces negligible hardware and communication protocol overheads.
长重构时间是动态可重构系统的主要瓶颈。已经提出了许多方法来解决这个问题。但是,通常不会考虑引入这种开销的配置电路的改进。高重新配置时间是由于通过受限数据路径发送的大量配置位。为了缓解这个问题,我们提出了一种新的FPGA配置电路架构来加速比特流(重)配置和重定位。使用MCNC基准集的实验结果表明,与最先进的方法相比,我们的提议的重新配置速度快4倍,重新定位效率高19.8倍。这是通过在飞行中只传输配置所需的数据和在重新定位时避免外部通信来实现的。此外,评估基准的配置比特流大小平均减少了65%。此外,我们的建议引入了可以忽略不计的硬件和通信协议开销。
{"title":"A novel configuration circuit architecture to speedup reconfiguration and relocation for partially reconfigurable devices","authors":"T. Marconi, J. Hur, K. Bertels, G. Gaydadjiev","doi":"10.1109/SASP.2010.5521140","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521140","url":null,"abstract":"Long reconfiguration times form a major bottleneck in dynamic reconfigurable systems. Many approaches have been proposed to address this problem. However, improvements in the configuration circuit that introduces this overhead are usually not considered. The high reconfiguration times are due to the large amount of configuration bits sent through a constrained data path. In order to alleviate this, we propose a novel FPGA configuration circuit architecture to speedup bitstream (re)configuration and relocation. Experimental results using the MCNC benchmark set indicate that our proposal reconfigures 4 times faster and relocates 19.8 times more efficient compared to the state of the art approaches. This is achieved by transporting only the data required for the configuration in flight and by avoiding external communication while relocating. Moreover, the configuration bitstream sizes of the evaluated benchmarks are reduced by 65%on average. In addition, our proposal introduces negligible hardware and communication protocol overheads.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122861496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A processing engine for GPS correlation GPS相关处理引擎
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521149
A. El-Rayis, T. Arslan, A. Erdogan
The correlation process in direct sequence spread spectrum (DSSS) communication systems is key in having successful signal reception. The implementation of real-time correlation in digital signal processors is one of key challenge in the realization of positioning systems today; as a result, most realizations are either application specific integrated circuits (ASIC) or Field Programmable Gate Array (FPGA) based. In this work we have introduced a new correlation engine targeting performance critical Global Positioning Satellite (GPS) based positioning. The processor is based on Reconfigurable Instruction Cell Array (RICA) paradigm. The GPS has been chosen due to its extensive integration in handheld devices (e.g. mobile phones) together with rising energy consumption concerns. We have designed, programmed and implemented several time-domain correlator engines based on RICA architectural paradigm. Various optimization techniques were implemented to adapt the processor to the correlation algorithm and in order to achieve the best performance. 12 and 24 channel correlators are tested using the new processor architecture.
直接序列扩频(DSSS)通信系统中的相关处理是保证信号接收成功的关键。在数字信号处理器中实现实时相关是当今定位系统实现的关键挑战之一;因此,大多数实现要么是专用集成电路(ASIC),要么是基于现场可编程门阵列(FPGA)。在这项工作中,我们介绍了一种新的相关引擎,针对性能关键的全球定位卫星(GPS)定位。该处理器基于可重构指令单元阵列(Reconfigurable Instruction Cell Array, RICA)范式。之所以选择GPS,是因为它广泛地集成在手持设备(如移动电话)中,同时也考虑到日益增长的能源消耗问题。我们设计、编程并实现了几个基于RICA架构范例的时域相关器引擎。为了使处理器适应相关算法并达到最佳性能,采用了各种优化技术。采用新的处理器架构对12通道和24通道相关器进行了测试。
{"title":"A processing engine for GPS correlation","authors":"A. El-Rayis, T. Arslan, A. Erdogan","doi":"10.1109/SASP.2010.5521149","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521149","url":null,"abstract":"The correlation process in direct sequence spread spectrum (DSSS) communication systems is key in having successful signal reception. The implementation of real-time correlation in digital signal processors is one of key challenge in the realization of positioning systems today; as a result, most realizations are either application specific integrated circuits (ASIC) or Field Programmable Gate Array (FPGA) based. In this work we have introduced a new correlation engine targeting performance critical Global Positioning Satellite (GPS) based positioning. The processor is based on Reconfigurable Instruction Cell Array (RICA) paradigm. The GPS has been chosen due to its extensive integration in handheld devices (e.g. mobile phones) together with rising energy consumption concerns. We have designed, programmed and implemented several time-domain correlator engines based on RICA architectural paradigm. Various optimization techniques were implemented to adapt the processor to the correlation algorithm and in order to achieve the best performance. 12 and 24 channel correlators are tested using the new processor architecture.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114352082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Ultra low energy Domain Specific Instruction-set Processor for on-line surveillance 用于在线监测的超低能量域特定指令集处理器
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521151
D. Novo, A. Kritikakou, P. Raghavan, L. Perre, J. Huisken, F. Catthoor
Many signal processing applications demand for highly energy efficient flexible implementations. In this paper, we propose a novel Domain Specific Instruction-set Processor (DSIP) architecture template which is tuned to deploy in the targeted domain of on-line surveillance. The architectur e, when implemented using a 40-nm CMOS standard cell library, executes a representative test vehicle with an energy efficiency of near ly 900 MOPS/mW including instruction and data memor ies. This is about 20 times higher than a state-of-the-ar t low power DSP architecture and less than a factor 2 below a heavily optimized ASIC realization for the same application benchmark.
许多信号处理应用需要高能效的灵活实现。在本文中,我们提出了一种新的领域特定指令集处理器(DSIP)架构模板,该架构模板被调优部署在在线监控的目标领域。当使用40纳米CMOS标准单元库实现该架构时,可执行具有代表性的测试车辆,其能效接近900 MOPS/mW,包括指令和数据存储器。这比目前最先进的低功耗DSP架构高出约20倍,比同样应用基准的高度优化的ASIC实现低不到1 / 2。
{"title":"Ultra low energy Domain Specific Instruction-set Processor for on-line surveillance","authors":"D. Novo, A. Kritikakou, P. Raghavan, L. Perre, J. Huisken, F. Catthoor","doi":"10.1109/SASP.2010.5521151","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521151","url":null,"abstract":"Many signal processing applications demand for highly energy efficient flexible implementations. In this paper, we propose a novel Domain Specific Instruction-set Processor (DSIP) architecture template which is tuned to deploy in the targeted domain of on-line surveillance. The architectur e, when implemented using a 40-nm CMOS standard cell library, executes a representative test vehicle with an energy efficiency of near ly 900 MOPS/mW including instruction and data memor ies. This is about 20 times higher than a state-of-the-ar t low power DSP architecture and less than a factor 2 below a heavily optimized ASIC realization for the same application benchmark.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121630959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Processor accelerator for AES AES处理器加速器
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521153
R. Lee, Yu-Yuan Chen
Software AES cipher performance is not fast enough for encryption to be incorporated ubiquitously for all computing needs. Furthermore, fast software implementations of AES that use table lookups are susceptible to software cache-based side channel attacks, leaking the secret encryption key. To bridge the gap between software and hardware AES implementations, several Instruction Set Architecture (ISA) extensions have been proposed to provide speedup for software AES programs, most notably the recent introduction of six AES-specific instructions for Intel microprocessors. However, algorithm-specific instructions are less desirable than general-purpose ones for microprocessors. In this paper, we propose an enhanced parallel table lookup instruction that can achieve the fastest reported software AES encryption and decryption of 1.38 cycles/byte for general-purpose microprocessors, a 1.45X speedup from the fastest prior work reported. Also, security is improved where cache-based side-channel attacks are thwarted, since all table lookups take the same amount of time. Furthermore, the new instructions can also be used to accelerate any functions that can be accelerated through table lookup operations of one or multiple small tables.
软件AES密码性能不够快,无法将加密纳入所有计算需求。此外,使用表查找的AES的快速软件实现容易受到基于软件缓存的侧通道攻击,泄露秘密加密密钥。为了弥合软件和硬件AES实现之间的差距,已经提出了几个指令集体系结构(ISA)扩展来为软件AES程序提供加速,最值得注意的是最近为英特尔微处理器引入的六个特定于AES的指令。然而,对于微处理器来说,特定于算法的指令不如通用指令理想。在本文中,我们提出了一种增强的并行表查找指令,它可以在通用微处理器上实现最快的软件AES加密和解密,速度为1.38周期/字节,比之前报道的最快速度提高了1.45倍。此外,在阻止基于缓存的侧通道攻击的情况下,安全性得到了提高,因为所有表查找都需要相同的时间。此外,新的指令还可以用来加速任何可以通过一个或多个小表的查找操作来加速的函数。
{"title":"Processor accelerator for AES","authors":"R. Lee, Yu-Yuan Chen","doi":"10.1109/SASP.2010.5521153","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521153","url":null,"abstract":"Software AES cipher performance is not fast enough for encryption to be incorporated ubiquitously for all computing needs. Furthermore, fast software implementations of AES that use table lookups are susceptible to software cache-based side channel attacks, leaking the secret encryption key. To bridge the gap between software and hardware AES implementations, several Instruction Set Architecture (ISA) extensions have been proposed to provide speedup for software AES programs, most notably the recent introduction of six AES-specific instructions for Intel microprocessors. However, algorithm-specific instructions are less desirable than general-purpose ones for microprocessors. In this paper, we propose an enhanced parallel table lookup instruction that can achieve the fastest reported software AES encryption and decryption of 1.38 cycles/byte for general-purpose microprocessors, a 1.45X speedup from the fastest prior work reported. Also, security is improved where cache-based side-channel attacks are thwarted, since all table lookups take the same amount of time. Furthermore, the new instructions can also be used to accelerate any functions that can be accelerated through table lookup operations of one or multiple small tables.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116174978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Next-generation consumer audio application specific embedded processor 下一代消费类音频应用专用嵌入式处理器
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521155
Ji Kong, Peilin Liu, Xianmin Chen, Jin Wang, Xingguang Pan, Jun Wang, He-D. Xiao, Zhenqi Wei, R. Ying
For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.
对于下一代音频应用,主要趋势是更高的采样率,更大的字长和更多的音频通道来播放音频数据。传统的dsp或嵌入式处理器由于其非特定或有限的计算能力以及片上存储器架构而对此类应用效率低下。本文提出了一种针对下一代音频应用的嵌入式音频处理器。音频专用指令集架构是在分析下一代音频处理需求的基础上提出的。此外,还提出了一种新型的紧耦合音频存储器,以支持极高的音频数据吞吐量和灵活的主存储器音频数据传输。为了评估所提出的音频处理器的性能,基于对下一代音频应用的分析,使用了一组基准测试。实施和评估结果表明,所提出的音频处理器在下一代音频应用中具有出色的效率和成本效益。
{"title":"Next-generation consumer audio application specific embedded processor","authors":"Ji Kong, Peilin Liu, Xianmin Chen, Jin Wang, Xingguang Pan, Jun Wang, He-D. Xiao, Zhenqi Wei, R. Ying","doi":"10.1109/SASP.2010.5521155","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521155","url":null,"abstract":"For next-generation audio applications, the dominant trends are much higher sample rate, larger word length and more audio channels for playback audio data. Traditional DSPs or embedded processors are inefficient for such kinds of applications because of their non-specific or limited computing capabilities as well as the on-chip memory architectures. In this paper, an embedded audio processor aiming at next-generation audio applications has been proposed. The audio specific instruction set architecture is based on the analysis of the requirements for next-generation audio processing. Besides, a novel tightly coupled audio memory has been proposed to support extremely high audio data throughputs and flexible audio data transfers with main memories. To evaluate the performance of the proposed audio processor, a set of benchmarks have been used based on the analysis of next-generation audio applications. The implementation and evaluation results lead to the conclusion that the proposed audio processor is of outstanding efficiency and cost-effectiveness for next-generation audio applications.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130666260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Minimizing write activities to non-volatile memory via scheduling and recomputation 通过调度和重新计算最小化对非易失性内存的写入活动
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521139
J. Hu, C. Xue, Wei-Che Tseng, Qingfeng Zhuge, E. Sha
Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.
非易失性存储器,如闪存、相变存储器(PCM)和磁随机存取存储器(MRAM),具有嵌入式DSP系统将其用作主存储器的许多理想特性。这些特性包括低成本、耐冲击、不挥发、节能和高密度。然而,在实际应用非易失性存储器作为主存储器之前,我们需要解决两个共同的挑战。首先,与DRAM相比,非易失性存储器具有有限的写/擦除周期。其次,在非易失性存储器上,写操作比读操作慢。这两个挑战可以通过减少非易失性主存储器上的写活动数量来解决。在本文中,我们提出了两种优化技术,写感知调度和重计算,以减少在非易失性存储器上的写活动。利用所提出的技术,我们既可以加快程序的完成时间,又可以延长非易失性存储器的使用寿命。实验结果表明,该方法可将非易失性存储器上的写操作次数平均减少55.71%。因此,非易失性存储器的寿命平均延长到原来的2.5倍。在使用NOR闪存的系统上,程序完成时间平均缩短55.32%,在使用NAND闪存的系统上,程序完成时间平均缩短40.69%。
{"title":"Minimizing write activities to non-volatile memory via scheduling and recomputation","authors":"J. Hu, C. Xue, Wei-Che Tseng, Qingfeng Zhuge, E. Sha","doi":"10.1109/SASP.2010.5521139","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521139","url":null,"abstract":"Non-volatile memories, such as flash memory, Phase Change Memory (PCM), and Magnetic Random Access Memory (MRAM), have many desirable characteristics for embedded DSP systems to employ them as main memory. These characteristics include low-cost, shock-resistivity, non-volatility, power-economy and high density. However, there are two common challenges we need to answer before we can apply non-volatile memory as main memory practically. First, non-volatile memory has limited write/erase cycles compared to DRAM. Second, a write operation is slower than a read operation on non-volatile memory. These two challenges can be answered by reducing the number of write activities on non-volatile main memory. In this paper, we propose two optimization techniques, write-aware scheduling and recomputation, to minimize write activities on non-volatile memory. With the proposed techniques, we can both speed up the completion time of programs and extend non-volatile memory's lifetime. The experimental results show that the proposed techniques can reduce the number of write activities on non-volatile memory by 55.71% on average. Thus, the lifetime of non-volatile memory is extend to 2.5 times as long as before on average. The completion time of programs can be reduced by 55.32% on systems with NOR flash memory and by 40.69% on systems with NAND flash memory on average.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113942734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Early performance-cost estimation of application-specific data path pipelining 特定于应用程序的数据路径管道的早期性能成本估计
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521136
Jelena Trajkovic, D. Gajski
Application-specific processors (ASPs) are increasingly being adopted for optimized implementation of embedded systems. ASP design automation tools are, therefore, critical for meeting the time-to-market goals for ASP-based embedded systems. This paper targets the problem of determining the optimal data path pipeline configuration from a given application C code. We propose a technique for automatically estimating the application execution time on an ASP for various data path pipeline configurations based on estimated clock cycle length and estimated number of cycles. In addition, we compute the cost of each pipelined design, thereby characterizing the ASP by its performance and cost. Our estimation enables fast, accurate and early analysis of trade-offs between different data path pipeline configurations, without the need for creating either a prototype or a cycle-accurate model of the ASP. Our experimental results, based on industrial applications, demonstrate high fidelity for the performance estimation.
应用程序专用处理器(asp)越来越多地被用于优化嵌入式系统的实现。因此,ASP设计自动化工具对于满足基于ASP的嵌入式系统的上市时间目标至关重要。本文针对从给定的应用程序C代码中确定最佳数据路径管道配置的问题。我们提出了一种基于估计时钟周期长度和估计周期数来自动估计ASP上各种数据路径管道配置的应用程序执行时间的技术。此外,我们计算了每个流水线设计的成本,从而通过其性能和成本来表征ASP。我们的评估能够快速、准确和早期地分析不同数据路径管道配置之间的权衡,而无需创建ASP的原型或周期精确模型。我们的实验结果,基于工业应用,证明了高保真的性能估计。
{"title":"Early performance-cost estimation of application-specific data path pipelining","authors":"Jelena Trajkovic, D. Gajski","doi":"10.1109/SASP.2010.5521136","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521136","url":null,"abstract":"Application-specific processors (ASPs) are increasingly being adopted for optimized implementation of embedded systems. ASP design automation tools are, therefore, critical for meeting the time-to-market goals for ASP-based embedded systems. This paper targets the problem of determining the optimal data path pipeline configuration from a given application C code. We propose a technique for automatically estimating the application execution time on an ASP for various data path pipeline configurations based on estimated clock cycle length and estimated number of cycles. In addition, we compute the cost of each pipelined design, thereby characterizing the ASP by its performance and cost. Our estimation enables fast, accurate and early analysis of trade-offs between different data path pipeline configurations, without the need for creating either a prototype or a cycle-accurate model of the ASP. Our experimental results, based on industrial applications, demonstrate high fidelity for the performance estimation.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115924397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient template matching with variable size templates in CUDA 有效的模板匹配与可变大小的模板在CUDA
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521142
Nicholas Moore, M. Leeser, L. King
Graphics processing units (GPUs) offer significantly higher peak performance than CPUs, but for a limited problem space. Even within this space, GPU solutions are often restricted to a set of specific problem instances or offer greatly varying performance for slightly different parameters. This makes providing a library of GPU implementations that is adaptable to arbitrary inputs a difficult task. This research is motivated by a MATLAB lung tumor tracking application that relies on two-dimensional correlation and uses large template sizes. While GPU-based template matching has been addressed in the past, template sizes were limited to specific, relatively small sizes and not acceptable for accelerating the target application. This paper discusses a CUDA implementation that supports large template sizes and is adaptable to arbitrary template dimensions. The implementation uses on-demand compilation of kernels and compile-time expansion of various kernel parameters to improve the implementation adaptability without sacrificing performance.
图形处理单元(gpu)提供比cpu高得多的峰值性能,但问题空间有限。即使在这个范围内,GPU解决方案通常也仅限于一组特定的问题实例,或者为稍微不同的参数提供巨大的性能变化。这使得提供一个可适应任意输入的GPU实现库成为一项困难的任务。本研究的动机是基于MATLAB的肺肿瘤跟踪应用程序,该应用程序依赖于二维相关,使用大模板尺寸。虽然过去已经解决了基于gpu的模板匹配问题,但模板大小仅限于特定的、相对较小的尺寸,对于加速目标应用程序来说是不可接受的。本文讨论了一种支持大模板尺寸并可适应任意模板尺寸的CUDA实现。该实现使用内核的按需编译和各种内核参数的编译时扩展来提高实现的适应性,而不牺牲性能。
{"title":"Efficient template matching with variable size templates in CUDA","authors":"Nicholas Moore, M. Leeser, L. King","doi":"10.1109/SASP.2010.5521142","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521142","url":null,"abstract":"Graphics processing units (GPUs) offer significantly higher peak performance than CPUs, but for a limited problem space. Even within this space, GPU solutions are often restricted to a set of specific problem instances or offer greatly varying performance for slightly different parameters. This makes providing a library of GPU implementations that is adaptable to arbitrary inputs a difficult task. This research is motivated by a MATLAB lung tumor tracking application that relies on two-dimensional correlation and uses large template sizes. While GPU-based template matching has been addressed in the past, template sizes were limited to specific, relatively small sizes and not acceptable for accelerating the target application. This paper discusses a CUDA implementation that supports large template sizes and is adaptable to arbitrary template dimensions. The implementation uses on-demand compilation of kernels and compile-time expansion of various kernel parameters to improve the implementation adaptability without sacrificing performance.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116323877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An RTOS in hardware for energy efficient software-based TCP/IP processing 基于软件的高效TCP/IP处理的硬件RTOS
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521147
Naotaka Maruyama, T. Ishihara, H. Yasuura
Many functions of embedded systems are implemented by software for flexibly dealing with frequent upgrade and for quickly fixing unpredictable bugs in applications. This system architecture is however generally less energy efficient than that implemented by application specific hardware. As a remedy of this issue, this paper proposes a processor-based platform having an RTOS in hardware for energy efficient and flexible TCP/IP processing. Unlike application specific hardware, implementing RTOS in hardware does not lose the fl exibility of the applications while the energy efficiency is comparable to the application specifi c hardware. Experiments with an actual TCP/IP application demonstrate that our approach achieves a 7 times improvement in energy effi ciency over an existing commercial fi rmware RTOS.
嵌入式系统的许多功能都是通过软件实现的,可以灵活地处理频繁的升级和快速修复应用程序中不可预测的错误。然而,这种系统体系结构通常不如由特定于应用程序的硬件实现的体系结构节能。为了解决这个问题,本文提出了一种基于处理器的平台,该平台在硬件上具有RTOS,用于节能和灵活的TCP/IP处理。与特定于应用程序的硬件不同,在硬件中实现RTOS不会失去应用程序的灵活性,而能源效率与特定于应用程序的硬件相当。实际TCP/IP应用的实验表明,我们的方法在能源效率方面比现有的商业固件RTOS提高了7倍。
{"title":"An RTOS in hardware for energy efficient software-based TCP/IP processing","authors":"Naotaka Maruyama, T. Ishihara, H. Yasuura","doi":"10.1109/SASP.2010.5521147","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521147","url":null,"abstract":"Many functions of embedded systems are implemented by software for flexibly dealing with frequent upgrade and for quickly fixing unpredictable bugs in applications. This system architecture is however generally less energy efficient than that implemented by application specific hardware. As a remedy of this issue, this paper proposes a processor-based platform having an RTOS in hardware for energy efficient and flexible TCP/IP processing. Unlike application specific hardware, implementing RTOS in hardware does not lose the fl exibility of the applications while the energy efficiency is comparable to the application specifi c hardware. Experiments with an actual TCP/IP application demonstrate that our approach achieves a 7 times improvement in energy effi ciency over an existing commercial fi rmware RTOS.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122845206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Reconfigurable custom functional unit generation and exploitation in multiple-issue processors 多问题处理器中可重构自定义功能单元的生成和开发
Pub Date : 2010-06-13 DOI: 10.1109/SASP.2010.5521135
I-Wei Wu, J. Shann, C. Chung
Recently, next-generation digital entertainment and mobile communication devices are driving the demand for high-performance processing solutions. In order to achieve this demand, multiple-issue processors such as very long instruction word (VLIW) architecture augmented with a reconfigurable hardware accelerator have been proposed in many papers. The reconfigurable hardware accelerator is usually realized by multiple functional units (FUs) organized in matrix fashion, called reconfigurable customized functional unit (RCFU). Since a multiple-issue processor can execute several data-independent operations simultaneously, executing operations on both of the RCFU and FUs of the base processor concurrently is reasonable and is also beneficial for improving the hardware resource utilization and the execution performance. Because of this observation, we propose an RCFU generation algorithm and an RCFU exploitation algorithm in this paper. In our experiment, 43% of execution performance improvement can be further achieved averagely compared with the previous works.1
最近,下一代数字娱乐和移动通信设备正在推动对高性能处理解决方案的需求。为了实现这一需求,许多论文提出了多问题处理器,如带有可重构硬件加速器的超长指令字(VLIW)体系结构。可重构硬件加速器通常由多个以矩阵方式组织的功能单元(FUs)来实现,称为可重构定制功能单元(RCFU)。由于多任务处理器可以同时执行多个与数据无关的操作,因此在基本处理器的RCFU和fu上同时执行操作是合理的,也有利于提高硬件资源利用率和执行性能。鉴于此,本文提出了一种RCFU生成算法和一种RCFU利用算法。在我们的实验中,与之前的工作相比,平均可以进一步实现43%的执行性能提升
{"title":"Reconfigurable custom functional unit generation and exploitation in multiple-issue processors","authors":"I-Wei Wu, J. Shann, C. Chung","doi":"10.1109/SASP.2010.5521135","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521135","url":null,"abstract":"Recently, next-generation digital entertainment and mobile communication devices are driving the demand for high-performance processing solutions. In order to achieve this demand, multiple-issue processors such as very long instruction word (VLIW) architecture augmented with a reconfigurable hardware accelerator have been proposed in many papers. The reconfigurable hardware accelerator is usually realized by multiple functional units (FUs) organized in matrix fashion, called reconfigurable customized functional unit (RCFU). Since a multiple-issue processor can execute several data-independent operations simultaneously, executing operations on both of the RCFU and FUs of the base processor concurrently is reasonable and is also beneficial for improving the hardware resource utilization and the execution performance. Because of this observation, we propose an RCFU generation algorithm and an RCFU exploitation algorithm in this paper. In our experiment, 43% of execution performance improvement can be further achieved averagely compared with the previous works.1","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124358075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2010 IEEE 8th Symposium on Application Specific Processors (SASP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1