首页 > 最新文献

2009 IEEE International Conference on Computer Design最新文献

英文 中文
A new verification method for embedded systems 一种新的嵌入式系统验证方法
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413154
Robert A. Thacker, C. Myers, K. R. Jones, S. Little
Verification of embedded systems is complicated by the fact that they are composed of digital hardware, analog sensors and actuators, and low level software. In order to verify the interaction of these heterogeneous components, it would be beneficial to have a single modeling formalism that is capable of representing all of these components. To address this need, this paper describes an extended labeled hybrid Petri net (LHPN) model that includes constructs for Boolean, discrete, and continuous variables as well as constructs to model timing. This paper also presents a method to verify these extended LHPNs. Finally, this paper presents a case study to illustrate the application of this model to the verification of a fault-tolerant temperature sensor.
嵌入式系统是由数字硬件、模拟传感器和执行器以及低级软件组成的,因此验证非常复杂。为了验证这些异构组件之间的交互作用,拥有一个能够表示所有这些组件的单一建模形式将是有益的。为了满足这一需求,本文描述了一个扩展的标记混合Petri网(LHPN)模型,该模型包括布尔、离散和连续变量的构造以及模型时序的构造。本文还提出了一种验证这些扩展LHPNs的方法。最后,本文给出了一个案例来说明该模型在容错温度传感器验证中的应用。
{"title":"A new verification method for embedded systems","authors":"Robert A. Thacker, C. Myers, K. R. Jones, S. Little","doi":"10.1109/ICCD.2009.5413154","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413154","url":null,"abstract":"Verification of embedded systems is complicated by the fact that they are composed of digital hardware, analog sensors and actuators, and low level software. In order to verify the interaction of these heterogeneous components, it would be beneficial to have a single modeling formalism that is capable of representing all of these components. To address this need, this paper describes an extended labeled hybrid Petri net (LHPN) model that includes constructs for Boolean, discrete, and continuous variables as well as constructs to model timing. This paper also presents a method to verify these extended LHPNs. Finally, this paper presents a case study to illustrate the application of this model to the verification of a fault-tolerant temperature sensor.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126255141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient calibration of thermal models based on application behavior 基于应用行为的热模型的有效校准
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413179
Youngwoo Ahn, Inchoon Yeo, R. Bettati
With increasing power densities, raising operating temperatures in chips threaten system reliability. Thermal control therefore has emerged as an important issue in system design and management. For dynamic thermal control to be effective, predictive thermal models of the system are needed. Such models typically use power as input, which renders them difficult to use in practical systems, where power monitoring is not available at processor or chip level. In this paper, we describe a methodology to infer the thermal model based on the monitoring of existing temperature sensors and of instruction counter registers. This allows the thermal model to be easily established, calibrated, and recalibrated at runtime to account for different thermal behavior due to either variations in fabrication or to varying environmental parameters. We validate the proposed methodology through a series of experiments. We also propose and validate an extension of the model and associated methodology for multicore processors.
随着功率密度的增加,芯片工作温度的升高会威胁到系统的可靠性。因此,热控制已成为系统设计和管理中的一个重要问题。为了使动态热控制有效,需要建立系统的预测热模型。此类模型通常使用电源作为输入,这使得它们难以在实际系统中使用,因为在处理器或芯片级别无法进行电源监控。在本文中,我们描述了一种基于现有温度传感器和指令计数器寄存器的监测来推断热模型的方法。这使得热模型可以很容易地建立、校准,并在运行时重新校准,以解释由于制造变化或环境参数变化而导致的不同热行为。我们通过一系列实验验证了所提出的方法。我们还提出并验证了多核处理器的模型和相关方法的扩展。
{"title":"Efficient calibration of thermal models based on application behavior","authors":"Youngwoo Ahn, Inchoon Yeo, R. Bettati","doi":"10.1109/ICCD.2009.5413179","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413179","url":null,"abstract":"With increasing power densities, raising operating temperatures in chips threaten system reliability. Thermal control therefore has emerged as an important issue in system design and management. For dynamic thermal control to be effective, predictive thermal models of the system are needed. Such models typically use power as input, which renders them difficult to use in practical systems, where power monitoring is not available at processor or chip level. In this paper, we describe a methodology to infer the thermal model based on the monitoring of existing temperature sensors and of instruction counter registers. This allows the thermal model to be easily established, calibrated, and recalibrated at runtime to account for different thermal behavior due to either variations in fabrication or to varying environmental parameters. We validate the proposed methodology through a series of experiments. We also propose and validate an extension of the model and associated methodology for multicore processors.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126258857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A power-aware hybrid RAM-CAM renaming mechanism for fast recovery 功率感知混合RAM-CAM重命名机制,用于快速恢复
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413160
S. Petit, R. Ubal, J. Sahuquillo, P. López
Modern superscalar processors implement register renaming by using either RAM or CAM tables. The design of these structures should address their access time and misprediction recovery penalty. While direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. Although they are more complex and slower, CAMs usually match the processor cycle in current designs. However, they do not scale with the number of physical registers and the pipeline width. In this paper we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides the current mappings quickly; on mispeculation, a low-complexity CAM enables immediate recovery and further register renaming. Compared to an ideal CAM in a 4-way state-of-the-art superscalar microprocessor, and for almost the same performance (1% slowdown) and area (95% of the ideal CAM size), the proposed scheme consumes about 90% less dynamic energy.
现代超标量处理器通过使用RAM表或CAM表实现寄存器重命名。这些结构的设计应该解决它们的访问时间和错误预测的恢复惩罚。虽然直接映射的ram提供更快的访问时间,但cam更适合避免恢复损失。虽然它们更复杂和更慢,但cam通常与当前设计中的处理器周期相匹配。然而,它们不随物理寄存器的数量和管道宽度而缩放。在本文中,我们提出了一种新的混合RAM-CAM寄存器重命名方案,它结合了这两种方法的优点。在稳定状态下,RAM快速提供当前映射;对于错误的猜测,低复杂性的CAM可以立即恢复并进一步注册重命名。与最先进的4路标量微处理器中的理想CAM相比,在几乎相同的性能(1%的减速)和面积(理想CAM尺寸的95%)下,所提出的方案消耗的动态能量减少了约90%。
{"title":"A power-aware hybrid RAM-CAM renaming mechanism for fast recovery","authors":"S. Petit, R. Ubal, J. Sahuquillo, P. López","doi":"10.1109/ICCD.2009.5413160","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413160","url":null,"abstract":"Modern superscalar processors implement register renaming by using either RAM or CAM tables. The design of these structures should address their access time and misprediction recovery penalty. While direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. Although they are more complex and slower, CAMs usually match the processor cycle in current designs. However, they do not scale with the number of physical registers and the pipeline width. In this paper we present a new hybrid RAM-CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides the current mappings quickly; on mispeculation, a low-complexity CAM enables immediate recovery and further register renaming. Compared to an ideal CAM in a 4-way state-of-the-art superscalar microprocessor, and for almost the same performance (1% slowdown) and area (95% of the ideal CAM size), the proposed scheme consumes about 90% less dynamic energy.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134261502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Multiplier-less and table-less linear approximation for square and square-root 平方和平方根的无乘数和无表线性近似
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413129
I. Park, Tae-Hwan Kim
Square and square-root are widely used in digital signal processing and digital communication algorithms, and their efficient realizations are commonly required to reduce the hardware complexity. In the implementation point of view, approximate realizations are often desired if they do not degrade performance significantly. In this paper, we propose new linear approximations for the square and square-root functions. The traditional linear approximations need multipliers to calculate slope offsets and tables to store initial offset values and slope values, whereas the proposed approximations exploit the inherent properties of square-related functions to linearly interpolate with only simple operations, such as shift, concatenation and addition, which are usually supported in modern VLSI systems. Regardless of the bit-width of the number system, more importantly, the maximum relative errors of the proposed approximations are bounded to 6.25% and 3.13% for square and square-root functions, respectively.
平方根和平方根在数字信号处理和数字通信算法中有着广泛的应用,为了降低硬件复杂度,通常需要它们的高效实现。从实现的角度来看,如果近似实现不会显著降低性能,则通常需要近似实现。在本文中,我们提出了新的平方根和平方根函数的线性近似。传统的线性近似需要乘法器来计算斜率偏移量,并需要表格来存储初始偏移值和斜率值,而所提出的近似利用了平方相关函数的固有特性,只需简单的操作就可以进行线性插值,例如移位、连接和加法,这些操作通常在现代VLSI系统中得到支持。不管数字系统的位宽如何,更重要的是,对于平方根函数和平方根函数,所提出的近似的最大相对误差分别被限制在6.25%和3.13%。
{"title":"Multiplier-less and table-less linear approximation for square and square-root","authors":"I. Park, Tae-Hwan Kim","doi":"10.1109/ICCD.2009.5413129","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413129","url":null,"abstract":"Square and square-root are widely used in digital signal processing and digital communication algorithms, and their efficient realizations are commonly required to reduce the hardware complexity. In the implementation point of view, approximate realizations are often desired if they do not degrade performance significantly. In this paper, we propose new linear approximations for the square and square-root functions. The traditional linear approximations need multipliers to calculate slope offsets and tables to store initial offset values and slope values, whereas the proposed approximations exploit the inherent properties of square-related functions to linearly interpolate with only simple operations, such as shift, concatenation and addition, which are usually supported in modern VLSI systems. Regardless of the bit-width of the number system, more importantly, the maximum relative errors of the proposed approximations are bounded to 6.25% and 3.13% for square and square-root functions, respectively.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131729127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Code density concerns for new architectures 新体系结构的代码密度问题
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413117
Vincent M. Weaver, S. Mckee
Reducing a program's instruction count can improve cache behavior and bandwidth utilization, lower power consumption, and increase overall performance. Nonetheless, code density is an often overlooked feature in studying processor architectures. We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores. We extend our results to investigate operating system, compiler, and system library effects on code density. We find that the executable starting address, executable format, and system call interface all affect program size. While ISA effects are important, the efficiency of the entire system stack must be taken into account when developing a new dense instruction set architecture.
减少程序的指令计数可以改善缓存行为和带宽利用率,降低功耗,并提高整体性能。然而,在研究处理器体系结构时,代码密度是一个经常被忽视的特性。我们在21种不同的指令集体系结构上手动优化汇编语言嵌入式基准,发现仅从ISA中就可以发现代码大小的三倍差异。我们发现,对代码密度贡献最大的体系结构特征是指令长度、寄存器数量、零寄存器的可用性、位宽度、硬件划分单元、指令操作数的数量以及未对齐加载和存储的可用性。我们扩展了我们的结果来研究操作系统、编译器和系统库对代码密度的影响。我们发现可执行文件的起始地址、可执行文件的格式和系统调用接口都会影响程序的大小。虽然ISA的效果很重要,但在开发新的密集指令集体系结构时,必须考虑整个系统堆栈的效率。
{"title":"Code density concerns for new architectures","authors":"Vincent M. Weaver, S. Mckee","doi":"10.1109/ICCD.2009.5413117","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413117","url":null,"abstract":"Reducing a program's instruction count can improve cache behavior and bandwidth utilization, lower power consumption, and increase overall performance. Nonetheless, code density is an often overlooked feature in studying processor architectures. We hand-optimize an assembly language embedded benchmark for size on 21 different instruction set architectures, finding up to a factor of three difference in code sizes from ISA alone. We find that the architectural features that contribute most heavily to code density are instruction length, number of registers, availability of a zero register, bit-width, hardware divide units, number of instruction operands, and the availability of unaligned loads and stores. We extend our results to investigate operating system, compiler, and system library effects on code density. We find that the executable starting address, executable format, and system call interface all affect program size. While ISA effects are important, the efficiency of the entire system stack must be taken into account when developing a new dense instruction set architecture.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115751564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A novel SoC architecture on FPGA for ultra fast face detection 一种基于FPGA的超快速人脸检测系统
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413122
Chunhui He, Alexandros Papakonstantinou, Deming Chen
Face detection is the cornerstone of a wide range of applications such as video surveillance, robotic vision and biometric authentication. One of the biggest challenges in face detection based applications is the speed at which faces can be accurately detected. In this paper, we present a novel SoC (System on Chip) architecture for ultra fast face detection in video or other image rich content. Our implementation is based on an efficient and robust algorithm that uses a cascade of Artificial Neural Network (ANN) classifiers on AdaBoost trained Haar features. The face detector architecture extracts the coarse grained parallelism by efficiently overlapping different computation phases while taking advantage of the finegrained parallelism at the module level. We provide details on the parallelism extraction achieved by our architecture and show experimental results that portray the efficiency of our face detection implementation. For the implementation and evaluation of our architecture we used the Xilinx FX130T Virtex5 FPGA device on the ML510 development board. Our performance evaluations indicate that a speedup of around 100X can be achieved over a SSE-optimized software implementation running on a 2.4GHz Core-2 Quad CPU. The detection speed reaches 625 frames per sec (fps).
人脸检测是视频监控、机器人视觉和生物识别认证等广泛应用的基石。在基于人脸检测的应用中,最大的挑战之一是准确检测人脸的速度。在本文中,我们提出了一种新的SoC (System on Chip)架构,用于视频或其他图像丰富内容的超快速人脸检测。我们的实现基于一种高效且鲁棒的算法,该算法在AdaBoost训练的Haar特征上使用级联的人工神经网络(ANN)分类器。人脸检测器架构通过有效地叠加不同计算阶段提取粗粒度并行性,同时利用模块级的细粒度并行性。我们提供了通过我们的架构实现的并行提取的细节,并展示了描述我们的人脸检测实现效率的实验结果。为了实现和评估我们的架构,我们在ML510开发板上使用了Xilinx FX130T Virtex5 FPGA器件。我们的性能评估表明,在2.4GHz Core-2 Quad CPU上运行的sse优化软件实现可以实现大约100倍的加速。检测速度达到625帧/秒。
{"title":"A novel SoC architecture on FPGA for ultra fast face detection","authors":"Chunhui He, Alexandros Papakonstantinou, Deming Chen","doi":"10.1109/ICCD.2009.5413122","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413122","url":null,"abstract":"Face detection is the cornerstone of a wide range of applications such as video surveillance, robotic vision and biometric authentication. One of the biggest challenges in face detection based applications is the speed at which faces can be accurately detected. In this paper, we present a novel SoC (System on Chip) architecture for ultra fast face detection in video or other image rich content. Our implementation is based on an efficient and robust algorithm that uses a cascade of Artificial Neural Network (ANN) classifiers on AdaBoost trained Haar features. The face detector architecture extracts the coarse grained parallelism by efficiently overlapping different computation phases while taking advantage of the finegrained parallelism at the module level. We provide details on the parallelism extraction achieved by our architecture and show experimental results that portray the efficiency of our face detection implementation. For the implementation and evaluation of our architecture we used the Xilinx FX130T Virtex5 FPGA device on the ML510 development board. Our performance evaluations indicate that a speedup of around 100X can be achieved over a SSE-optimized software implementation running on a 2.4GHz Core-2 Quad CPU. The detection speed reaches 625 frames per sec (fps).","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"173 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114178917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 56
The impact of liquid cooling on 3D multi-core processors 液体冷却对3D多核处理器的影响
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413115
H. Jang, I. Yoon, C. Kim, Seungwon Shin, S. Chung
Recently, 3D integration has been regarded as one of the most promising techniques due to its abilities of reducing global wire lengths and lowering power consumption. However, 3D integrated processors inevitably cause higher power density and lower thermal conductivity, since the closer proximity of heat generating dies makes existing thermal hotspots more severe. Without an efficient cooling method inside the package, 3D integrated processors should suffer severe performance degradation by dynamic thermal management as well as reliability problems. In this paper, we analyze the impact of the liquid cooling on a 3D multi-core processor compared to the conventional air cooling. We also evaluate the leakage power consumption and the lifetime reliability depending on the temperature of each functional unit in the 3D multi-core processor. The simulation results show that the liquid cooling reduces the temperature of the L1 instruction cache (the hottest block in this evaluation) by as much as 45 degrees, resulting in 12.8% leakage reduction, on average, compared to the conventional air cooling. Moreover, the reduced temperature of the L1 instruction cache also improves the reliability of electromigration, stress migration, time-dependent dielectric breakdown, thermal cycling, and negative bias temperature instability significantly.
最近,3D集成被认为是最有前途的技术之一,因为它能够缩短总线长度和降低功耗。然而,3D集成处理器不可避免地会导致更高的功率密度和更低的导热系数,因为产热模具的距离更近,使现有的热热点更加严重。如果封装内部没有有效的冷却方法,3D集成处理器将因动态热管理和可靠性问题而遭受严重的性能下降。在本文中,我们分析了液体冷却对三维多核处理器的影响,并与传统的空气冷却进行了比较。我们还评估了泄漏功耗和寿命可靠性取决于三维多核处理器中每个功能单元的温度。仿真结果表明,与传统的空气冷却相比,液体冷却使L1指令缓存(本次评估中最热的块)的温度降低了45度,平均减少了12.8%的泄漏。此外,L1指令缓存温度的降低也显著提高了电迁移、应力迁移、时变介电击穿、热循环和负偏置温度不稳定性的可靠性。
{"title":"The impact of liquid cooling on 3D multi-core processors","authors":"H. Jang, I. Yoon, C. Kim, Seungwon Shin, S. Chung","doi":"10.1109/ICCD.2009.5413115","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413115","url":null,"abstract":"Recently, 3D integration has been regarded as one of the most promising techniques due to its abilities of reducing global wire lengths and lowering power consumption. However, 3D integrated processors inevitably cause higher power density and lower thermal conductivity, since the closer proximity of heat generating dies makes existing thermal hotspots more severe. Without an efficient cooling method inside the package, 3D integrated processors should suffer severe performance degradation by dynamic thermal management as well as reliability problems. In this paper, we analyze the impact of the liquid cooling on a 3D multi-core processor compared to the conventional air cooling. We also evaluate the leakage power consumption and the lifetime reliability depending on the temperature of each functional unit in the 3D multi-core processor. The simulation results show that the liquid cooling reduces the temperature of the L1 instruction cache (the hottest block in this evaluation) by as much as 45 degrees, resulting in 12.8% leakage reduction, on average, compared to the conventional air cooling. Moreover, the reduced temperature of the L1 instruction cache also improves the reliability of electromigration, stress migration, time-dependent dielectric breakdown, thermal cycling, and negative bias temperature instability significantly.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121208006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis 使用缓存堆叠的3D GPU架构:性能、成本、功耗和热分析
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413147
Ahmed Al-Maashri, Guangyu Sun, Xiangyu Dong, V. Narayanan, Yuan Xie
Graphics Processing Units (GPUs) offer tremendous computational and processing power. The architecture requires high communication bandwidth and lower latency between computation units and caches. 3D die-stacking technology is a promising approach to meet such requirements. To the best of our knowledge no other study has investigated the implementation of 3D technology in GPUs. In this paper, we study the impact of stacking caches using the 3D technology on GPU performance. We also investigate the benefits of using 3D stacked MRAM on GPUs. Our work includes cost, power, and thermal analysis of the proposed architectural designs. Our results show a 53% geometric mean performance speedup for iso-cycle time architectures and about 19% for iso-cost architectures.
图形处理单元(gpu)提供了巨大的计算和处理能力。该架构要求计算单元和缓存之间具有较高的通信带宽和较低的延迟。3D模堆技术是满足这一要求的一种很有前途的方法。据我们所知,没有其他研究调查了gpu中3D技术的实现。在本文中,我们研究了使用3D技术堆叠缓存对GPU性能的影响。我们还研究了在gpu上使用3D堆叠MRAM的好处。我们的工作包括对提议的建筑设计进行成本、功率和热分析。我们的结果表明,等周期时间架构的几何平均性能提升了53%,等成本架构的几何平均性能提升了19%。
{"title":"3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis","authors":"Ahmed Al-Maashri, Guangyu Sun, Xiangyu Dong, V. Narayanan, Yuan Xie","doi":"10.1109/ICCD.2009.5413147","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413147","url":null,"abstract":"Graphics Processing Units (GPUs) offer tremendous computational and processing power. The architecture requires high communication bandwidth and lower latency between computation units and caches. 3D die-stacking technology is a promising approach to meet such requirements. To the best of our knowledge no other study has investigated the implementation of 3D technology in GPUs. In this paper, we study the impact of stacking caches using the 3D technology on GPU performance. We also investigate the benefits of using 3D stacked MRAM on GPUs. Our work includes cost, power, and thermal analysis of the proposed architectural designs. Our results show a 53% geometric mean performance speedup for iso-cycle time architectures and about 19% for iso-cost architectures.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114943380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
VariPipe: Low-overhead variable-clock synchronous pipelines VariPipe:低开销的可变时钟同步管道
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413167
Navid Toosizadeh, S. Zaky, Jianwen Zhu
Synchronous pipelines usually have a fixed clock frequency determined by the worst-case process-voltage-temperature (PVT) analysis of the most critical path. Higher operating frequencies are possible under typical PVT conditions, especially when the most critical path is not triggered. This paper introduces a design methodology that uses asynchronous design to generate the clock of a synchronous pipeline. The result is a variable clock period that changes cycle-by-cycle according to the current operations in the pipeline and the current PVT conditions. The paper also presents a simple design flow to implement variable-clock systems with standard cells using conventional synchronous design tools. The variable-clock pipeline technique has been tested on a 32-bit microprocessor in 90nm technology. Post-layout simulations with three sets of benchmarks demonstrate that the variable-clock processor has a two-fold performance advantage over its fixed-clock counterpart. The overhead of the added clock generation circuit is merely 2.6% in area and 3% in energy consumption, compared to an earlier proposal that costs 100% overhead.
同步管道通常有一个固定的时钟频率,由最关键路径的最坏情况过程电压温度(PVT)分析确定。在典型的PVT条件下,更高的工作频率是可能的,特别是当最关键的路径没有被触发时。本文介绍了一种采用异步设计生成同步流水线时钟的设计方法。其结果是一个可变时钟周期,根据管道中的当前操作和当前PVT条件逐周期变化。本文还介绍了一个简单的设计流程,以实现标准单元可变时钟系统使用传统的同步设计工具。可变时钟流水线技术已在90纳米工艺的32位微处理器上进行了测试。使用三组基准测试的布局后模拟表明,可变时钟处理器比固定时钟处理器具有两倍的性能优势。增加的时钟产生电路的开销仅为2.6%的面积和3%的能耗,而早期的方案开销为100%。
{"title":"VariPipe: Low-overhead variable-clock synchronous pipelines","authors":"Navid Toosizadeh, S. Zaky, Jianwen Zhu","doi":"10.1109/ICCD.2009.5413167","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413167","url":null,"abstract":"Synchronous pipelines usually have a fixed clock frequency determined by the worst-case process-voltage-temperature (PVT) analysis of the most critical path. Higher operating frequencies are possible under typical PVT conditions, especially when the most critical path is not triggered. This paper introduces a design methodology that uses asynchronous design to generate the clock of a synchronous pipeline. The result is a variable clock period that changes cycle-by-cycle according to the current operations in the pipeline and the current PVT conditions. The paper also presents a simple design flow to implement variable-clock systems with standard cells using conventional synchronous design tools. The variable-clock pipeline technique has been tested on a 32-bit microprocessor in 90nm technology. Post-layout simulations with three sets of benchmarks demonstrate that the variable-clock processor has a two-fold performance advantage over its fixed-clock counterpart. The overhead of the added clock generation circuit is merely 2.6% in area and 3% in energy consumption, compared to an earlier proposal that costs 100% overhead.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129121177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
On-chip bidirectional wiring for heavily pipelined systems using network coding 片上双向布线重流水线系统使用网络编码
Pub Date : 2009-10-04 DOI: 10.1109/ICCD.2009.5413165
Kalyana C. Bollapalli, Rajesh Garg, Kanupriya Gulati, S. Khatri
In this paper, we describe a low-area, reduced-power on-chip point-to-point bidirectional communication scheme for heavily pipelined systems. When data needs to be transmitted bidirectionally between two on-chip locations, the traditional approach resorts to either using two unidirectional wires, or to using a single wire (with a unidirectional transfer at any given time instant). In contrast, our bidirectional communication scheme allows data to be transmitted simultaneously between two on-chip locations, with a single wire performing the bidirectional data transfer. Our approach borrows ideas from the emerging area of network coding (in the field of communication). By utilizing coding units (which also serve the purpose of buffering the signals) along the wire between the two endpoints, we are able to achieve the same throughput as a traditional approach, while reducing the total area utilization by about 49.8% (thereby reducing routing congestion), and the total power consumption by about 11.5%. The area and power results include the contribution of routing wires, coding units, drivers, the clock distribution network and the required reset wire. Our bidirectional communication approach is ideally suited for heavily pipelined data intensive systems.
在本文中,我们描述了一种低面积,低功耗的片上点对点双向通信方案,用于重流水线系统。当数据需要在两个片上位置之间双向传输时,传统的方法要么使用两条单向线,要么使用一条线(在任何给定的时间瞬间进行单向传输)。相比之下,我们的双向通信方案允许数据在两个片上位置之间同时传输,用单线执行双向数据传输。我们的方法借鉴了新兴的网络编码领域(在通信领域)的想法。通过利用两个端点之间的编码单元(也用于缓冲信号),我们能够实现与传统方法相同的吞吐量,同时将总面积利用率降低约49.8%(从而减少路由拥塞),并将总功耗降低约11.5%。面积和功率结果包括路由线、编码单元、驱动器、时钟分配网络和所需复位线的贡献。我们的双向通信方法非常适合于高度流水线的数据密集型系统。
{"title":"On-chip bidirectional wiring for heavily pipelined systems using network coding","authors":"Kalyana C. Bollapalli, Rajesh Garg, Kanupriya Gulati, S. Khatri","doi":"10.1109/ICCD.2009.5413165","DOIUrl":"https://doi.org/10.1109/ICCD.2009.5413165","url":null,"abstract":"In this paper, we describe a low-area, reduced-power on-chip point-to-point bidirectional communication scheme for heavily pipelined systems. When data needs to be transmitted bidirectionally between two on-chip locations, the traditional approach resorts to either using two unidirectional wires, or to using a single wire (with a unidirectional transfer at any given time instant). In contrast, our bidirectional communication scheme allows data to be transmitted simultaneously between two on-chip locations, with a single wire performing the bidirectional data transfer. Our approach borrows ideas from the emerging area of network coding (in the field of communication). By utilizing coding units (which also serve the purpose of buffering the signals) along the wire between the two endpoints, we are able to achieve the same throughput as a traditional approach, while reducing the total area utilization by about 49.8% (thereby reducing routing congestion), and the total power consumption by about 11.5%. The area and power results include the contribution of routing wires, coding units, drivers, the clock distribution network and the required reset wire. Our bidirectional communication approach is ideally suited for heavily pipelined data intensive systems.","PeriodicalId":256908,"journal":{"name":"2009 IEEE International Conference on Computer Design","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124734797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2009 IEEE International Conference on Computer Design
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1