首页 > 最新文献

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors最新文献

英文 中文
Specialization of the Cell SPE for Media Applications 媒体应用的Cell SPE专门化
C. Meenderinck, B. Juurlink
There is a clear trend towards multi-cores to meet the performance requirements of emerging and future applications. A different way to scale performance is, however, to specialize the cores for specific application domains. This option is especially attractive for low-cost embedded systems where less silicon area directly translates to less cost. We propose architectural enhancements to specialize the Cell SPE for video decoding. Specifically, based on deficiencies we observed in the H.264 kernels, we propose a handful of application-specific instructions to improve performance. The speedups achieved are between 1.84 and 2.37.
为了满足新兴和未来应用的性能要求,多核是一个明显的趋势。然而,扩展性能的另一种方法是为特定的应用程序域专门设置内核。这种选择对于低成本嵌入式系统特别有吸引力,因为更少的硅面积直接转化为更低的成本。我们提出了架构改进,使Cell SPE专门化用于视频解码。具体来说,基于我们在H.264内核中观察到的缺陷,我们提出了一些特定于应用程序的指令来提高性能。实现的加速在1.84到2.37之间。
{"title":"Specialization of the Cell SPE for Media Applications","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/ASAP.2009.10","DOIUrl":"https://doi.org/10.1109/ASAP.2009.10","url":null,"abstract":"There is a clear trend towards multi-cores to meet the performance requirements of emerging and future applications. A different way to scale performance is, however, to specialize the cores for specific application domains. This option is especially attractive for low-cost embedded systems where less silicon area directly translates to less cost. We propose architectural enhancements to specialize the Cell SPE for video decoding. Specifically, based on deficiencies we observed in the H.264 kernels, we propose a handful of application-specific instructions to improve performance. The speedups achieved are between 1.84 and 2.37.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116949485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Combined Decimal and Binary Floating-Point Multiplier 一个组合的十进制和二进制浮点乘法器
C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton
In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded decimal floating-point (DFP) numbers or (2) 64-bit binary floating-point (BFP) numbers. It returns properly rounded results for the rounding modes specified in IEEE 754-2008. The design shares the following hardware resources between the two floating-point datatypes: a 54-bit by 54-bit binary multiplier, portions of the operand encoding/decoding, a 54-bit right shifter, exponent calculation logic, and rounding logic. Our synthesis results show that hardware sharing is feasible and has a reasonable impact on area, latency, and delay. The combined BFP and DFP multiplier occupies only 58% of the total area that would be required by separate BFP and DFP units. Furthermore, the critical path delay of a combined multiplier has a negligible increase over a standalone DFP multiplier, without increasing the number of cycles to perform either BFP or DFP multiplication.
本文描述了基于IEEE 754-2008浮点标准规范的二进制和十进制组合浮点乘法器的第一个硬件设计。乘数设计可以对(1)64位二进制编码的十进制浮点数(DFP)或(2)64位二进制浮点数(BFP)进行操作。它为IEEE 754-2008中指定的舍入模式返回正确的舍入结果。该设计在两种浮点数据类型之间共享以下硬件资源:一个54位乘54位二进制乘法器、部分操作数编码/解码、一个54位右移器、指数计算逻辑和舍入逻辑。我们的综合结果表明,硬件共享是可行的,并且对面积、延迟和延迟有合理的影响。合并后的BFP和DFP乘数只占单独BFP和DFP单位所需总面积的58%。此外,与单独的DFP乘法器相比,组合乘法器的关键路径延迟的增加可以忽略不计,而不会增加执行BFP或DFP乘法的周期数。
{"title":"A Combined Decimal and Binary Floating-Point Multiplier","authors":"C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton","doi":"10.1109/ASAP.2009.28","DOIUrl":"https://doi.org/10.1109/ASAP.2009.28","url":null,"abstract":"In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded decimal floating-point (DFP) numbers or (2) 64-bit binary floating-point (BFP) numbers. It returns properly rounded results for the rounding modes specified in IEEE 754-2008. The design shares the following hardware resources between the two floating-point datatypes: a 54-bit by 54-bit binary multiplier, portions of the operand encoding/decoding, a 54-bit right shifter, exponent calculation logic, and rounding logic. Our synthesis results show that hardware sharing is feasible and has a reasonable impact on area, latency, and delay. The combined BFP and DFP multiplier occupies only 58% of the total area that would be required by separate BFP and DFP units. Furthermore, the critical path delay of a combined multiplier has a negligible increase over a standalone DFP multiplier, without increasing the number of cycles to perform either BFP or DFP multiplication.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116057850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
An Input Triggered Polymorphic ASIC for H.264 Decoding 一种用于H.264解码的输入触发多态ASIC
Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan
This paper reports the design of an input--triggered polymorphic ASIC for H.264 baseline decoder.Hardware polymorphism is achieved by selectively reusing hardware resources at system and module level. Complete design is done using ESL design tools following a methodology that maintains consistency in testing and verification throughout the design flow. The proposed design can support frame sizes from QCIF to 1080p.
本文报道了一种用于H.264基线解码器的输入触发多态ASIC的设计。硬件多态性是通过在系统级和模块级有选择地重用硬件资源来实现的。完整的设计是使用ESL设计工具完成的,遵循在整个设计流程中保持测试和验证一致性的方法。提出的设计可以支持从QCIF到1080p的帧大小。
{"title":"An Input Triggered Polymorphic ASIC for H.264 Decoding","authors":"Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan","doi":"10.1109/ASAP.2009.7","DOIUrl":"https://doi.org/10.1109/ASAP.2009.7","url":null,"abstract":"This paper reports the design of an input--triggered polymorphic ASIC for H.264 baseline decoder.Hardware polymorphism is achieved by selectively reusing hardware resources at system and module level. Complete design is done using ESL design tools following a methodology that maintains consistency in testing and verification throughout the design flow. The proposed design can support frame sizes from QCIF to 1080p.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117097788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving VLIW Processor Performance Using Three-Dimensional (3D) DRAM Stacking 利用三维(3D) DRAM堆叠改善VLIW处理器性能
Yangyang Pan, Tong Zhang
This work studies the potential of using emerging 3D integration to improve embedded VLIW computing system. We focus on the 3D integration of one VLIW processor die with multiple high-capacity DRAM dies. Our proposed memory architecture employs 3D stacking technology to bond one die containing several processing clusters to multiple DRAM dies for a primary memory. The 3D technology also enables wide low-latency buses between clusters and memory and enable the latency of 3D DRAM L2 cache comparable to 2D SRAM L2 cache. These enable it to replace the 2D SRAM L2 cache with 3D DRAM L2 cache. The die area for 2D SRAM L2 cache can be re-allocated to additional clusters that can improve the performance of the system. From the simulation results, we find 3D stacking DRAM main memory can improve the system performance by 10%~80% than 2D off-chip DRAM main memory depending on different benchmarks. Also, for a similar logic die area, a four clusters system with 3D DRAM L2 cache and 3D DRAM main memory outperforms a two clusters system with 2D SRAM L2 cache and 3D DRAM main memory by about 10%.
这项工作研究了利用新兴的3D集成来改进嵌入式VLIW计算系统的潜力。我们专注于一个VLIW处理器芯片与多个高容量DRAM芯片的3D集成。我们提出的内存架构采用3D堆叠技术,将一个包含多个处理集群的芯片粘合到多个DRAM芯片上作为主存储器。3D技术还支持集群和内存之间的宽低延迟总线,并使3D DRAM L2缓存的延迟可与2D SRAM L2缓存相媲美。这使得它能够用3D DRAM L2缓存取代2D SRAM L2缓存。2D SRAM L2缓存的芯片区域可以重新分配给可以提高系统性能的其他集群。从仿真结果来看,根据不同的基准测试,3D堆叠DRAM主存比2D片外DRAM主存能提高10%~80%的系统性能。此外,对于类似的逻辑芯片面积,具有3D DRAM L2缓存和3D DRAM主存的四集群系统比具有2D SRAM L2缓存和3D DRAM主存的两集群系统高出约10%。
{"title":"Improving VLIW Processor Performance Using Three-Dimensional (3D) DRAM Stacking","authors":"Yangyang Pan, Tong Zhang","doi":"10.1109/ASAP.2009.11","DOIUrl":"https://doi.org/10.1109/ASAP.2009.11","url":null,"abstract":"This work studies the potential of using emerging 3D integration to improve embedded VLIW computing system. We focus on the 3D integration of one VLIW processor die with multiple high-capacity DRAM dies. Our proposed memory architecture employs 3D stacking technology to bond one die containing several processing clusters to multiple DRAM dies for a primary memory. The 3D technology also enables wide low-latency buses between clusters and memory and enable the latency of 3D DRAM L2 cache comparable to 2D SRAM L2 cache. These enable it to replace the 2D SRAM L2 cache with 3D DRAM L2 cache. The die area for 2D SRAM L2 cache can be re-allocated to additional clusters that can improve the performance of the system. From the simulation results, we find 3D stacking DRAM main memory can improve the system performance by 10%~80% than 2D off-chip DRAM main memory depending on different benchmarks. Also, for a similar logic die area, a four clusters system with 3D DRAM L2 cache and 3D DRAM main memory outperforms a two clusters system with 2D SRAM L2 cache and 3D DRAM main memory by about 10%.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"445 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Division Unit for Binary Integer Decimals 二进制整数小数的除法单位
T. Lang, A. Nannarelli
In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (Binary Integer Decimal or BID) for significands. Recent decimal division designs are all based on the Binary Coded Decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID.
在这项工作中,我们提出了一个基于数字递归算法的基数-10除法单元,并实现了有效位数的二进制编码(二进制整数十进制或BID)。最近的十进制除法设计都是基于二进制编码的十进制(BCD)编码。我们将基数-10数字递归算法应用于BID表示,并在标准单元技术中实现了除法单元。将所提出的BID分割单元的实现与实现相同算法的基于BCD的单元的实现进行了比较。比较表明,对于规范化操作数,BID单元具有与BCD单元相同的延迟和减少的面积,但在BID中实现规范化时成本更高。
{"title":"Division Unit for Binary Integer Decimals","authors":"T. Lang, A. Nannarelli","doi":"10.1109/ASAP.2009.42","DOIUrl":"https://doi.org/10.1109/ASAP.2009.42","url":null,"abstract":"In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (Binary Integer Decimal or BID) for significands. Recent decimal division designs are all based on the Binary Coded Decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126537583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scalar Processing Overhead on SIMD-Only Architectures 纯simd架构上的标量处理开销
A. Azevedo, B. Juurlink
The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications.In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.
Cell处理器由一个通用核心和八个核心组成,具有完整的SIMD指令集。虽然最初是为多媒体和游戏设计的,但它目前被用于更广泛的应用程序。在本文中,我们评估了单元spe是否可以使用两种方法从标量处理单元中显著获益。在第一种方法中,通过用四字数据类型替换所有标量数据类型来消除标量处理开销。这种方法只适用于相对较小的内核。在第二种方法中,将SPE性能与配置类似的PPU的性能进行比较,后者支持标量操作。实验结果表明,小核的标量处理开销为19% ~ 57%,大核为12% ~ 39%。还讨论了消除这种开销的解决方案。
{"title":"Scalar Processing Overhead on SIMD-Only Architectures","authors":"A. Azevedo, B. Juurlink","doi":"10.1109/ASAP.2009.12","DOIUrl":"https://doi.org/10.1109/ASAP.2009.12","url":null,"abstract":"The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications.In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131757753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs NeMo:一个使用gpu的尖峰神经元神经建模平台
A. Fidjeland, E. Roesch, M. Shanahan, W. Luk
Simulating spiking neural networks is of great interest to scientists wanting to model the functioning of the brain. However, large-scale models are expensive to simulate due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in an embodied setting, the simulation must be real-time in order to be useful. In this paper we present NeMo, a platform for such simulations which achieves high performance through the use of highly parallel commodity hardware in the form of graphics processing units (GPUs). NeMo makes use of the Izhikevich neuron model which provides a range of realistic spiking dynamics while being computationally efficient. Our GPU kernel can deliver up to 400 million spikes per second. This corresponds to a real-time simulation of around 40 000 neurons under biologically plausible conditions with 1000 synapses per neuron and a mean firing rate of 10 Hz.
对于想要模拟大脑功能的科学家来说,模拟尖峰神经网络是非常有趣的。然而,由于大脑中神经元的数量和相互联系,大规模模型的模拟成本很高。此外,当这种模拟用于具体设置时,为了有用,模拟必须是实时的。在本文中,我们介绍了NeMo,这是一个通过使用图形处理单元(gpu)形式的高度并行商品硬件实现高性能的模拟平台。NeMo利用了Izhikevich神经元模型,该模型在计算效率高的同时提供了一系列现实的尖峰动态。我们的GPU内核每秒可以提供高达4亿次峰值。这相当于在生物学上合理的条件下对大约4万个神经元进行实时模拟,每个神经元有1000个突触,平均放电频率为10赫兹。
{"title":"NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs","authors":"A. Fidjeland, E. Roesch, M. Shanahan, W. Luk","doi":"10.1109/ASAP.2009.24","DOIUrl":"https://doi.org/10.1109/ASAP.2009.24","url":null,"abstract":"Simulating spiking neural networks is of great interest to scientists wanting to model the functioning of the brain. However, large-scale models are expensive to simulate due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in an embodied setting, the simulation must be real-time in order to be useful. In this paper we present NeMo, a platform for such simulations which achieves high performance through the use of highly parallel commodity hardware in the form of graphics processing units (GPUs). NeMo makes use of the Izhikevich neuron model which provides a range of realistic spiking dynamics while being computationally efficient. Our GPU kernel can deliver up to 400 million spikes per second. This corresponds to a real-time simulation of around 40 000 neurons under biologically plausible conditions with 1000 synapses per neuron and a mean firing rate of 10 Hz.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 105
Filtering Global History: Power and Performance Efficient Branch Predictor 过滤全局历史:功率和性能高效分支预测器
R. Ayoub, A. Orailoglu
In this paper we present an Application Customizable Branch Predictor, ACBP, that delivers efficiency in energy savings and performance without compromising prediction accuracy. The idea of our technique is to filter unnecessary global history information within the global history register to minimize the predictor size while maintaining prediction accuracy. We suggest in this work an efficient algorithm to capture the beneficial correlations. A cost-efficient and programmable hardware architecture is presented. Extensive experimental analysis confirms significant improvements in power savings and latency, ranging up to 84% and 30%,respectively.
在本文中,我们提出了一个应用可定制分支预测器,ACBP,它在不影响预测准确性的情况下提供了节能和性能的效率。我们的技术思想是在全局历史寄存器中过滤不必要的全局历史信息,以最小化预测器的大小,同时保持预测的准确性。在这项工作中,我们提出了一种有效的算法来捕获有益的相关性。提出了一种低成本、可编程的硬件结构。大量的实验分析证实了在节能和延迟方面的显著改进,分别提高了84%和30%。
{"title":"Filtering Global History: Power and Performance Efficient Branch Predictor","authors":"R. Ayoub, A. Orailoglu","doi":"10.1109/ASAP.2009.26","DOIUrl":"https://doi.org/10.1109/ASAP.2009.26","url":null,"abstract":"In this paper we present an Application Customizable Branch Predictor, ACBP, that delivers efficiency in energy savings and performance without compromising prediction accuracy. The idea of our technique is to filter unnecessary global history information within the global history register to minimize the predictor size while maintaining prediction accuracy. We suggest in this work an efficient algorithm to capture the beneficial correlations. A cost-efficient and programmable hardware architecture is presented. Extensive experimental analysis confirms significant improvements in power savings and latency, ranging up to 84% and 30%,respectively.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Acceleration of Multiresolution Imaging Algorithms: A Comparative Study 多分辨率成像加速算法的比较研究
Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich
In this paper we consider a multiresolution filter and its realization on the Cell BE and GPUs. We not only present common and specific optimization strategies undertaken for obtaining maximum performance on these architectures, but also how to obtain a speedup of 6.57x and 33.24x compared to an optimized OpenMP baseline implementation. Furthermore, we also undertake automated configuration space exploration of different partitioning possibilities for selection of best tiling parameters.
本文研究了一种多分辨率滤波器及其在Cell BE和gpu上的实现。我们不仅介绍了在这些架构上获得最大性能的常用和特定的优化策略,而且还介绍了与优化的OpenMP基线实现相比,如何获得6.57倍和33.24倍的加速。此外,我们还进行了不同分区可能性的自动配置空间探索,以选择最佳平铺参数。
{"title":"Acceleration of Multiresolution Imaging Algorithms: A Comparative Study","authors":"Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.8","DOIUrl":"https://doi.org/10.1109/ASAP.2009.8","url":null,"abstract":"In this paper we consider a multiresolution filter and its realization on the Cell BE and GPUs. We not only present common and specific optimization strategies undertaken for obtaining maximum performance on these architectures, but also how to obtain a speedup of 6.57x and 33.24x compared to an optimized OpenMP baseline implementation. Furthermore, we also undertake automated configuration space exploration of different partitioning possibilities for selection of best tiling parameters.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Parallel Prefix Ling Structures for Modulo 2^n-1 Addition 模2^n-1加法的平行前缀Ling结构
Jun Chen, J. Stine
Parallel-prefix adders draw significant amounts of attention within general-purpose and application-specific architectures because of their logarithmic delay and efficient implementation in VLSI. This paper proposes a scheme to enhance parallel-prefix adders for modulo $2^n - 1$ addition by incorporating Ling equations into parallel-prefix structures. As opposed to previous research, this work clarifies the use of Ling equations for Modulo and provides enhancements to its implementation. Results are given in this work for a placed and routed design within a variation-aware 45nm technology. The implementation results show a significant improvement in delay and even a reduction in power dissipation.
并行前缀加法器在通用和特定于应用程序的体系结构中引起了大量关注,因为它们在VLSI中具有对数延迟和高效实现。本文提出了一种将Ling方程引入到并行前缀结构中来增强模$2^n - 1$加法的并行前缀加法器的方案。与以前的研究相反,这项工作澄清了Ling方程的模的使用,并提供了对其实现的增强。结果在这项工作中给出了在变化感知45nm技术中的放置和路由设计。实现结果显示延迟显著改善,甚至降低了功耗。
{"title":"Parallel Prefix Ling Structures for Modulo 2^n-1 Addition","authors":"Jun Chen, J. Stine","doi":"10.1109/ASAP.2009.43","DOIUrl":"https://doi.org/10.1109/ASAP.2009.43","url":null,"abstract":"Parallel-prefix adders draw significant amounts of attention within general-purpose and application-specific architectures because of their logarithmic delay and efficient implementation in VLSI. This paper proposes a scheme to enhance parallel-prefix adders for modulo $2^n - 1$ addition by incorporating Ling equations into parallel-prefix structures. As opposed to previous research, this work clarifies the use of Ling equations for Modulo and provides enhancements to its implementation. Results are given in this work for a placed and routed design within a variation-aware 45nm technology. The implementation results show a significant improvement in delay and even a reduction in power dissipation.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122390631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1