Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.最新文献_第3页

Adaptive cache compression for high-performance processors 高性能处理器的自适应缓存压缩

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006719

Alaa R. Alameldeen, D. Wood

Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access latency, potentially degrading performance. In this paper, we develop an adaptive policy that dynamically adapts to the costs and benefits of cache compression. We propose a two-level cache hierarchy where the L1 cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way set-associative with LRU replacement, where each set can store up to eight compressed lines but has space for only four uncompressed lines. On each L2 reference, the LRU stack depth and compressed size determine whether compression (could have) eliminated a miss or incurs an unnecessary decompression overhead. Based on this outcome, the adaptive policy updates a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. We evaluate adaptive cache compression using full-system simulation and a range of benchmarks. We show that compression can improve performance for memory-intensive commercial workloads by up to 17%. However, always using compression hurts performance for low-miss-rate benchmarks - due to unnecessary decompression overhead - degrading performance by up to 18%. By dynamically monitoring workload behavior, the adaptive policy achieves comparable benefits from compression, while never degrading performance by more than 0.4%.

现代处理器使用两级或两级以上的缓存存储器来弥合处理器和存储器速度之间日益增长的差距。压缩可以通过增加有效的缓存容量和消除遗漏来提高缓存性能。但是，解压缩缓存线路也会增加缓存访问延迟，可能会降低性能。在本文中，我们开发了一种动态适应缓存压缩成本和收益的自适应策略。我们提出了一个两级缓存层次结构，其中L1缓存保存未压缩数据，L2缓存在压缩和未压缩存储之间动态选择。L2缓存是与LRU替换相关联的8路集合，其中每个集合最多可以存储8条压缩线，但只有4条未压缩线的空间。在每个L2引用上，LRU堆栈深度和压缩大小决定了压缩(本可以)是否消除了丢失，还是会导致不必要的解压缩开销。基于此结果，自适应策略更新单个全局饱和计数器，该计数器预测是否以压缩或未压缩形式分配行。我们使用全系统模拟和一系列基准测试来评估自适应缓存压缩。我们表明，压缩可以将内存密集型商业工作负载的性能提高17%。然而，总是使用压缩会损害低失误率基准测试的性能——由于不必要的解压开销——性能下降高达18%。通过动态监控工作负载行为，自适应策略从压缩中获得了相当的好处，同时性能的下降不会超过0.4%。

{"title":"Adaptive cache compression for high-performance processors","authors":"Alaa R. Alameldeen, D. Wood","doi":"10.1145/1028176.1006719","DOIUrl":"https://doi.org/10.1145/1028176.1006719","url":null,"abstract":"Modern processors use two or more levels of cache memories to bridge the rising disparity between processor and memory speeds. Compression can improve cache performance by increasing effective cache capacity and eliminating misses. However, decompressing cache lines also increases cache access latency, potentially degrading performance. In this paper, we develop an adaptive policy that dynamically adapts to the costs and benefits of cache compression. We propose a two-level cache hierarchy where the L1 cache holds uncompressed data and the L2 cache dynamically selects between compressed and uncompressed storage. The L2 cache is 8-way set-associative with LRU replacement, where each set can store up to eight compressed lines but has space for only four uncompressed lines. On each L2 reference, the LRU stack depth and compressed size determine whether compression (could have) eliminated a miss or incurs an unnecessary decompression overhead. Based on this outcome, the adaptive policy updates a single global saturating counter, which predicts whether to allocate lines in compressed or uncompressed form. We evaluate adaptive cache compression using full-system simulation and a range of benchmarks. We show that compression can improve performance for memory-intensive commercial workloads by up to 17%. However, always using compression hurts performance for low-miss-rate benchmarks - due to unnecessary decompression overhead - degrading performance by up to 18%. By dynamically monitoring workload behavior, the adaptive policy achieves comparable benefits from compression, while never degrading performance by more than 0.4%.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133843050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 313

A content aware integer register file organization 一个内容感知的整数寄存器文件组织

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006727

Rubén González, A. Cristal, Daniel Ortega, A. Veidenbaum, M. Valero

A register file is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the register file's area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processor's performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer register file organization, which reduces energy consumption, area, and access time of the register file with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new register file is described and shown to obtain proposed optimized register file designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new register file. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the register file access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.

寄存器文件是现代超标量处理器的重要组成部分。它有大量的入口和读/写端口，以便实现高水平的指令并行性。因此，寄存器文件的面积、访问时间和能耗会急剧增加，从而显著影响整个超标量处理器的性能和能耗。在64位处理器中尤其如此。本文提出了一种新的整数寄存器文件组织方式，在对整个IPC影响最小的情况下，降低了寄存器文件的能耗、占用的空间和访问时间。这是通过利用一个新概念来实现的，部分值局部性，它被定义为在其位的子集中相同的多个活值实例的出现。描述并展示了新寄存器文件的一种可能实现，以获得所提出的优化寄存器文件设计。总的来说，在新的寄存器文件中，能耗减少了50%以上，面积减少了18%，访问时间减少了15%。假设时钟频率相同，整数应用的IPC降低了1.7%，而数值应用的IPC降低了微不足道的0.3%，从而节省了能源和面积。如果可以通过减少寄存器文件访问时间来提高时钟频率，则可能实现高达13%的性能提升。这种方法支持其他非常有前途的优化，本文概述了其中的三个。

{"title":"A content aware integer register file organization","authors":"Rubén González, A. Cristal, Daniel Ortega, A. Veidenbaum, M. Valero","doi":"10.1145/1028176.1006727","DOIUrl":"https://doi.org/10.1145/1028176.1006727","url":null,"abstract":"A register file is a critical component of a modern superscalar processor. It has a large number of entries and read/write ports in order to enable high levels of instruction parallelism. As a result, the register file's area, access time, and energy consumption increase dramatically, significantly affecting the overall superscalar processor's performance and energy consumption. This is especially true in 64-bit processors. This paper presents a new integer register file organization, which reduces energy consumption, area, and access time of the register file with a minimal effect on overall IPC. This is accomplished by exploiting a new concept, partial value locality, which is defined as occurrence of multiple live value instances identical in a subset of their bits. A possible implementation of the new register file is described and shown to obtain proposed optimized register file designs. Overall, an energy reduction of over 50%, a 18% decrease in area, and a 15% reduction in the access time are achieved in the new register file. The energy and area savings are achieved with a 1.7% reduction in IPC for integer applications and a negligible 0.3% in numerical applications, assuming the same clock frequency. A performance increase of up to 13% is possible if the clock frequency can be increases due to a reduction in the register file access time. This approach enables other, very promising optimizations, three of which are outlined in the paper.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114157905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams 原始微处理器的评估:用于ILP和流的暴露线延迟架构

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006733

M. Taylor, Walter Lee, Jason E. Miller, D. Wentzlaff, Ian Bratt, B. Greenwald, H. Hoffmann, Paul R. Johnson, J. Kim, James Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, Saman P. Amarasinghe, A. Agarwal

This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.

本文对Raw微处理器进行了评价。Raw解决了构建一个通用架构的挑战，该架构在比现有微处理器更大的流和嵌入式计算应用程序上表现良好，同时在面对不断增加的线延迟时仍然运行现有的基于ilp的顺序程序，并具有合理的性能。Raw通过以平铺排列的方式实现大量片上资源(包括逻辑、线路和引脚)，并通过新的ISA公开它们来应对这一挑战，以便软件可以利用这些资源进行并行应用程序。Raw通过在点对点标量操作数网络上在架构公开的功能单元之间路由操作数来支持ILP和流。该网络为标量数据传输提供了低延迟。Raw通过暴露互连和使用软件编排标量和流数据传输来管理线路延迟的影响。我们已经在IBM的180纳米、6层铜、CMOS 7SF标准单元ASIC工艺中实现了一个原型Raw微处理器。我们还实现了ILP和流编译器。我们的评估试图确定Raw在多大程度上成功地实现了作为一个更通用的处理器的目标。实现这一目标的核心是Raw能够利用所有形式的并行性，包括ILP、DLP、TLP和流并行性。具体来说，我们评估了Raw在不同代码集上的性能，包括传统的顺序程序，流应用程序，服务器工作负载和位级嵌入式计算。我们的实验方法利用了一个周期精确的模拟器，对我们的实际硬件进行了验证。与使用商用PC内存系统组件的180nm Pentium-III相比，对于ILP非常低的顺序应用程序，Raw的性能在2/spl倍/之内，对于更高水平的ILP，大约2/spl倍/到9/spl倍/更好，当高度并行应用程序用流语言编码或手动优化时，10/spl倍/-100/spl倍/更好。本文还提出了一种新的通用性度量，并用它来讨论Raw的通用性。

{"title":"Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams","authors":"M. Taylor, Walter Lee, Jason E. Miller, D. Wentzlaff, Ian Bratt, B. Greenwald, H. Hoffmann, Paul R. Johnson, J. Kim, James Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, Saman P. Amarasinghe, A. Agarwal","doi":"10.1145/1028176.1006733","DOIUrl":"https://doi.org/10.1145/1028176.1006733","url":null,"abstract":"This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125170908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 451

Field-testing IMPACT EPIC research results in Itanium 2 现场测试IMPACT EPIC研究成果在Itanium 2

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006735

J. Sias, Sain-Zee Ueng, G. A. Kent, I. Steiner, E. Nystrom, Wen-mei W. Hwu

Explicitly-Parallel Instruction Computing (EPIC) provides architectural features, including predication and explicit control speculation, intended to enhance the compiler's ability to expose instruction-level parallelism (ILP) in control-intensive programs. Aggressive structural transformations using these features, though described in the literature, have not yet been fully characterized in complete systems. Using the Intel Itanium 2 microprocessor, the SPECint2000 benchmarks and the IMPACT Compiler for IA-64, a research compiler competitive with the best commercial compilers on the platform, we provide an in situ evaluation of code generated using aggressive, EPIC-enabled techniques in a reality-constrained microarchitecture. Our work shows a 1.13 average speedup (up to 1.50) due to these compilation techniques, relative to traditionally-optimized code at the same inlining and pointer analysis levels, and a 1.55 speedup (up to 2.30) relative to GNU GCC, a solid traditional compiler. Detailed results show that the structural compilation approach provides benefits far beyond a decrease in branch misprediction penalties and that it both positively and negatively impacts instruction cache performance. We also demonstrate the increasing significance of runtime effects, such as data cache and TLB, in determining end performance and the interaction of these effects with control speculation.

显式并行指令计算(EPIC)提供了架构特性，包括预测和显式控制推测，旨在增强编译器在控制密集型程序中暴露指令级并行性(ILP)的能力。利用这些特征的积极的结构转换，虽然在文献中有描述，但尚未在完整的系统中充分表征。使用Intel Itanium 2微处理器，SPECint2000基准测试和IA-64的IMPACT编译器(与平台上最好的商业编译器竞争的研究编译器)，我们提供了在现实约束的微架构中使用积极的epic支持技术生成的代码的现场评估。我们的工作表明，由于这些编译技术，相对于在相同内联和指针分析级别上的传统优化代码，平均加速提高了1.13(最高1.50)，相对于GNU GCC(一个可靠的传统编译器)，平均加速提高了1.55(最高2.30)。详细的结果表明，结构编译方法提供的好处远远超过减少分支错误预测的惩罚，它对指令缓存性能既有积极的影响，也有消极的影响。我们还证明了运行时效应(如数据缓存和TLB)在决定最终性能以及这些效应与控制推测的相互作用方面的日益重要的意义。

{"title":"Field-testing IMPACT EPIC research results in Itanium 2","authors":"J. Sias, Sain-Zee Ueng, G. A. Kent, I. Steiner, E. Nystrom, Wen-mei W. Hwu","doi":"10.1145/1028176.1006735","DOIUrl":"https://doi.org/10.1145/1028176.1006735","url":null,"abstract":"Explicitly-Parallel Instruction Computing (EPIC) provides architectural features, including predication and explicit control speculation, intended to enhance the compiler's ability to expose instruction-level parallelism (ILP) in control-intensive programs. Aggressive structural transformations using these features, though described in the literature, have not yet been fully characterized in complete systems. Using the Intel Itanium 2 microprocessor, the SPECint2000 benchmarks and the IMPACT Compiler for IA-64, a research compiler competitive with the best commercial compilers on the platform, we provide an in situ evaluation of code generated using aggressive, EPIC-enabled techniques in a reality-constrained microarchitecture. Our work shows a 1.13 average speedup (up to 1.50) due to these compilation techniques, relative to traditionally-optimized code at the same inlining and pointer analysis levels, and a 1.55 speedup (up to 2.30) relative to GNU GCC, a solid traditional compiler. Detailed results show that the structural compilation approach provides benefits far beyond a decrease in branch misprediction penalties and that it both positively and negatively impacts instruction cache performance. We also demonstrate the increasing significance of runtime effects, such as data cache and TLB, in determining end performance and the interaction of these effects with control speculation.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122851368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Wire delay is not a problem for SMT (in the near future) 线延迟对SMT来说不是问题(在不久的将来)

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006706

T. N. Vijaykumar, Zeshan A. Chishti

Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper, we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43% increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.

先前的论文已经表明，与逻辑延迟相比，线延迟的缓慢缩放将阻止超标量性能随技术的缩放。在本文中，我们表明，当考虑导线延迟时，随着技术的发展，超标量的最佳管道变得更浅，从而加强了先前的结果，即深层管道的性能与浅管道一样好。缺乏性能可伸缩性的关键原因是，超标量没有足够的并行性来隐藏相对增加的线延迟。然而，同步多线程(SMT)提供了急需的并行性。我们表明，运行仅具有4路问题的多程序工作负载的SMT不仅在技术代中保持最佳管道深度，使每代时钟速度至少增加43%，而且还通过IPC实现了每代两个预期加速的剩余部分。随着线延迟在未来技术中占据主导地位，至少在不久的将来50nm技术之前，程序的数量需要适度地扩展以保持扩展趋势。虽然这个结果忽略了带宽限制，但是使用SMT来容忍由线路延迟引起的延迟并不是那么简单，因为SMT会导致带宽问题。现代乱序问题管道的大多数阶段都采用RAM和CAM结构。传统的、延迟优化的RAM/CAM结构中的导线延迟阻止了它们以规模化的方式进行流水线。我们表明，这种限制阻止了SMT吞吐量的扩展。我们使用位线缩放来允许RAM/CAM带宽随技术扩展。位线扩展使SMT吞吐量在不久的将来以每一代技术两个的速度扩展。

{"title":"Wire delay is not a problem for SMT (in the near future)","authors":"T. N. Vijaykumar, Zeshan A. Chishti","doi":"10.1145/1028176.1006706","DOIUrl":"https://doi.org/10.1145/1028176.1006706","url":null,"abstract":"Previous papers have shown that the slow scaling of wire delays compared to logic delays will prevent superscalar performance from scaling with technology. In this paper, we show that the optimal pipeline for superscalar becomes shallower with technology, when wire delays are considered, tightening previous results that deeper pipelines perform only as well as shallower pipelines. The key reason for the lack of performance scaling is that superscalar does not have sufficient parallelism to hide the relatively-increased wire delays. However, Simultaneous Multithreading (SMT) provides the much-needed parallelism. We show that an SMT running a multiprogrammed workload with just 4-way issue not only retains the optimal pipeline depth over technology generations, enabling at least 43% increase in clock speed every generation, but also achieves the remainder of the expected speedup of two per generation through IPC. As wire delays become more dominant in future technologies, the number of programs needs to be scaled modestly to maintain the scaling trends, at least till the near-future 50nm technology. While this result ignores bandwidth constraints, using SMT to tolerate latency due to wire delays is not that simple because SMT causes bandwidth problems. Most of the stages of a modern out-of-order-issue pipeline employ RAM and CAM structures. Wire delays in conventional, latency-optimized RAM/CAM structures prevent them from being pipelined in a scaled manner. We show that this limitation prevents scaling of SMT throughput. We use bitline scaling to allow RAM/CAM bandwidth to scale with technology. Bitline scaling enables SMT throughput to scale at the rate of two per technology generation in the near future.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116179722","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy 一种复杂度有效的指令级时间冗余ALU带宽增强方法

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006732

A. Parashar, S. Gurumurthi, A. Sivasubramaniam

Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of up to 45% in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23% of the overall IPC loss.

先前在乱序核中实现指令级时间冗余的建议报告了在某些应用程序中与没有任何时间冗余的执行相比，性能下降高达45%。造成这个问题的一个重要原因是alu的数量不足以处理注入核心的放大负载。同时，增加alu的数量会增加问题逻辑的复杂性，这已经被指出是处理器中最关键的时序组件之一。本文提出了一种对先前指令重用思想的新扩展，通过利用双(临时冗余)指令流的某些有趣特性，以一种复杂性有效的方式缓解ALU带宽需求。我们提出了实现指令重用缓冲区(IRB)所需的微架构扩展，并将其与双指令流超标量核心的问题逻辑集成，并进行了广泛的评估，以证明它可以很好地缓解ALU带宽问题。我们表明，对于指令级临时冗余超标量执行，我们平均可以挽回近50%由于ALU带宽限制而导致的IPC损失，以及23%的总体IPC损失。

{"title":"A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy","authors":"A. Parashar, S. Gurumurthi, A. Sivasubramaniam","doi":"10.1145/1028176.1006732","DOIUrl":"https://doi.org/10.1145/1028176.1006732","url":null,"abstract":"Previous proposals for implementing instruction-level temporal redundancy in out-of-order cores have reported a performance degradation of up to 45% in certain applications compared to an execution which does not have any temporal redundancy. An important contributor to this problem is the insufficient number of ALUs for handling the amplified load injected into the core. At the same time, increasing the number of ALUs can increase the complexity of the issue logic, which has been pointed out to be one of the most timing critical components of the processor. This paper proposes a novel extension of a prior idea on instruction reuse to ease ALU bandwidth requirements in a complexity-effective way by exploiting certain interesting properties of a dual (temporally redundant) instruction stream. We present microarchitectural extensions necessary for implementing an instruction reuse buffer (IRB) and integrating this with the issue logic of a dual instruction stream superscalar core, and conduct extensive evaluations to demonstrate how well it can alleviate the ALU bandwidth problem. We show that on the average we can gain back nearly 50% of the IPC loss that occurred due to ALU bandwidth limitations for an instruction-level temporally redundant superscalar execution, and 23% of the overall IPC loss.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129536944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

Techniques to reduce the soft error rate of a high-performance microprocessor 降低高性能微处理器软错误率的技术

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006723

Christopher T. Weaver, J. Emer, Shubhendu S. Mukherjee, S. Reinhardt

Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipeline's error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program's output.

由于中子和α粒子撞击造成的瞬态故障对未来技术中增加处理器晶体管数量构成了重大障碍。虽然单个晶体管的故障率可能不会显著上升，但将更多的晶体管集成到一个器件中会使该器件更容易发生故障。因此，将处理器错误率维持在可接受的水平将需要增加设计工作。本文提出了两种降低错误率的简单方法，并评价了它们在微处理器指令队列中的应用。第一种技术通过在遇到长延迟时选择性地压缩指令来减少指令在易受攻击的存储结构中的时间。如果受其影响的结构不包含有效指令，则故障不太可能导致错误。我们引入了一个新的度量，MITF(平均失效指令)，以捕获这种方法引入的性能和可靠性之间的权衡。第二种技术处理假检测错误。在没有故障检测机制的情况下，这样的错误不会影响程序的最终结果。例如，影响动态失效指令结果的错误不会改变最终的程序输出，但仍然可以被硬件标记为错误。为了避免发出这种虚假错误的信号，我们修改了管道的错误检测逻辑，将受影响的指令和数据标记为可能不正确，而不是立即发出错误信号。然后，只有当我们后来确定可能不正确的值可能影响了程序的输出时，才发出错误信号。

{"title":"Techniques to reduce the soft error rate of a high-performance microprocessor","authors":"Christopher T. Weaver, J. Emer, Shubhendu S. Mukherjee, S. Reinhardt","doi":"10.1145/1028176.1006723","DOIUrl":"https://doi.org/10.1145/1028176.1006723","url":null,"abstract":"Transient faults due to neutron and alpha particle strikes pose a significant obstacle to increasing processor transistor counts in future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a device makes that device more likely to encounter a fault. Hence, maintaining processor error rates at acceptable levels will require increasing design effort. This paper proposes two simple approaches to reduce error rates and evaluates their application to a microprocessor instruction queue. The first technique reduces the time instructions sit in vulnerable storage structures by selectively squashing instructions when long delays are encountered. A fault is less likely to cause an error if the structure it affects does not contain valid instructions. We introduce a new metric, MITF (Mean Instructions To Failure), to capture the trade-off between performance and reliability introduced by this approach. The second technique addresses false detected errors. In the absence of a fault detection mechanism, such errors would not have affected the final outcome of a program. For example, a fault affecting the result of a dynamically dead instruction would not change the final program output, but could still be flagged by the hardware as an error. To avoid signalling such false errors, we modify a pipeline's error detection logic to mark affected instructions and data as possibly incorrect rather than immediately signaling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program's output.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123105797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 287

Exploiting resonant behavior to reduce inductive noise 利用谐振特性来降低电感噪声

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006726

Michael D. Powell, T. N. Vijaykumar

Inductive noise in high-performance microprocessors is a reliability issue caused by variations in processor current (di/dt) which are converted to supply-voltage glitches by impedances in the power-supply network. Inductive noise has been addressed by using decoupling capacitors to maintain low impedance in the power supply over a wide range of frequencies. However, even well-designed power supplies exhibit (a few) peaks of high impedance at resonant frequencies caused by RLC resonant loops. Previous architectural proposals adjust current variations by controlling instruction fetch and issue, trading off performance and energy for noise reduction. However, the proposals do not consider some conceptual issues and have implementation challenges. The issues include requiring fast response, responding to variations that do not threaten the noise margins, or responding to variations only at the resonant frequency while the range of high impedance extends to a resonance band around the resonant frequency. While previous schemes reduce the magnitude of variations, our proposal, called resonance tuning, changes the frequency of current variations away from the resonance band to a non-resonant frequency to be absorbed by the power supply. Because inductive noise is a resonance problem, resonance tuning reacts only to repeated variations in the resonance band, and not to isolated variations. Reacting after a few repetitions allows more time for the response and reduces unnecessary responses, decreasing performance and energy loss.

高性能微处理器中的感应噪声是由处理器电流(di/dt)变化引起的可靠性问题，这些变化由供电网络中的阻抗转换为供电电压小故障。电感噪声已通过使用去耦电容器在宽频率范围内保持电源的低阻抗来解决。然而，即使是设计良好的电源，在RLC谐振回路引起的谐振频率处也会出现(少数)高阻抗峰值。以前的架构建议通过控制指令的获取和发出来调整当前的变化，以降低噪音来权衡性能和能源。然而，这些建议没有考虑到一些概念性问题，并且在实施方面存在挑战。这些问题包括要求快速响应，响应不威胁噪声边界的变化，或者仅响应谐振频率的变化，而高阻抗范围扩展到谐振频率周围的谐振带。虽然以前的方案减少了变化的幅度，但我们的建议，称为共振调谐，将电流变化的频率从谐振带改变到非谐振频率，以被电源吸收。因为感应噪声是一个共振问题，共振调谐只对共振频带的重复变化作出反应，而不是对孤立的变化作出反应。在重复几次后做出反应，可以为反应留出更多的时间，减少不必要的反应，降低性能和能量损失。

{"title":"Exploiting resonant behavior to reduce inductive noise","authors":"Michael D. Powell, T. N. Vijaykumar","doi":"10.1145/1028176.1006726","DOIUrl":"https://doi.org/10.1145/1028176.1006726","url":null,"abstract":"Inductive noise in high-performance microprocessors is a reliability issue caused by variations in processor current (di/dt) which are converted to supply-voltage glitches by impedances in the power-supply network. Inductive noise has been addressed by using decoupling capacitors to maintain low impedance in the power supply over a wide range of frequencies. However, even well-designed power supplies exhibit (a few) peaks of high impedance at resonant frequencies caused by RLC resonant loops. Previous architectural proposals adjust current variations by controlling instruction fetch and issue, trading off performance and energy for noise reduction. However, the proposals do not consider some conceptual issues and have implementation challenges. The issues include requiring fast response, responding to variations that do not threaten the noise margins, or responding to variations only at the resonant frequency while the range of high impedance extends to a resonance band around the resonant frequency. While previous schemes reduce the magnitude of variations, our proposal, called resonance tuning, changes the frequency of current variations away from the resonance band to a non-resonant frequency to be absorbed by the power supply. Because inductive noise is a resonance problem, resonance tuning reacts only to repeated variations in the resonance band, and not to isolated variations. Reacting after a few repetitions allows more time for the response and reduces unnecessary responses, decreasing performance and energy loss.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132280744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Microarchitecture optimizations for exploiting memory-level parallelism 利用内存级并行性的微架构优化

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006708

Yuan Chou, Brian Fahs, S. Abraham

The performance of memory-bound commercial applications such as databases is limited by increasing memory latencies. In this paper, we show that exploiting memory-level parallelism (MLP) is an effective approach for improving the performance of these applications and that microarchitecture has a profound impact on achievable MLP. Using the epoch model of MLP, we reason how traditional microarchitecture features such as out-of-order issue and state-of-the-art microarchitecture techniques such as runahead execution affect MLP. Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing instructions is needed to attain the full benefits of large out-of-order instruction windows. The results also show that a processor's issue window and reorder buffer should be decoupled to exploit MLP more efficiently. In addition, we demonstrate that runahead execution is highly effective in enhancing MLP, potentially improving the MLP of the database workload by 82% and its overall performance by 60%. Finally, our limit study shows that there is considerable headroom in improving MLP and overall performance by implementing effective instruction prefetching, more accurate branch prediction and better value prediction in addition to runahead execution.

内存受限的商业应用程序(如数据库)的性能受到内存延迟增加的限制。在本文中，我们证明了利用内存级并行性(MLP)是提高这些应用程序性能的有效方法，并且微架构对实现的MLP具有深远的影响。使用MLP的epoch模型，我们推断传统的微架构特性(如乱序问题)和最先进的微架构技术(如提前执行)如何影响MLP。仿真结果表明，适度激进的乱序问题处理器比有序问题处理器提高了12-30%的MLP，并且需要积极处理负载、分支和序列化指令以获得大型乱序指令窗口的全部好处。结果还表明，处理器的问题窗口和重排序缓冲区应该解耦，以更有效地利用MLP。此外，我们证明了提前执行在增强MLP方面非常有效，可能将数据库工作负载的MLP提高82%，并将其总体性能提高60%。最后，我们的极限研究表明，除了运行前执行之外，通过实现有效的指令预取、更准确的分支预测和更好的值预测，在提高MLP和整体性能方面还有相当大的空间。

{"title":"Microarchitecture optimizations for exploiting memory-level parallelism","authors":"Yuan Chou, Brian Fahs, S. Abraham","doi":"10.1145/1028176.1006708","DOIUrl":"https://doi.org/10.1145/1028176.1006708","url":null,"abstract":"The performance of memory-bound commercial applications such as databases is limited by increasing memory latencies. In this paper, we show that exploiting memory-level parallelism (MLP) is an effective approach for improving the performance of these applications and that microarchitecture has a profound impact on achievable MLP. Using the epoch model of MLP, we reason how traditional microarchitecture features such as out-of-order issue and state-of-the-art microarchitecture techniques such as runahead execution affect MLP. Simulation results show that a moderately aggressive out-of-order issue processor improves MLP over an in-order issue processor by 12-30%, and that aggressive handling of loads, branches and serializing instructions is needed to attain the full benefits of large out-of-order instruction windows. The results also show that a processor's issue window and reorder buffer should be decoupled to exploit MLP more efficiently. In addition, we demonstrate that runahead execution is highly effective in enhancing MLP, potentially improving the MLP of the database workload by 82% and its overall performance by 60%. Finally, our limit study shows that there is considerable headroom in improving MLP and overall performance by implementing effective instruction prefetching, more accurate branch prediction and better value prediction in addition to runahead execution.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117215712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 235

Immunet: a cheap and robust fault-tolerant packet routing mechanism 免疫:一种廉价且健壮的容错包路由机制

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

Pub Date : 2004-06-19 DOI: 10.1145/1028176.1006718

Valentin Puente, J. Gregorio, F. Vallejo, R. Beivide

A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resources. Immunet has four important advantages over previous fault-tolerant switching mechanisms. Its low hardware costs minimize the overhead that the network must support in absence of faults. As long as the network remains connected, Immunet can tolerate any number of failures regardless of their spatial and temporal combinations. The resulting communication infrastructure provides optimized adaptive minimal routing over the surviving topology. The system behavior under successive failures exhibits graceful performance degradation. Immunet reconfiguration can be totally transparent to the applications running on the parallel system as they will only be affected by the loss of those data packets circulating through the broken components. The rest of the packets will suffer only a tolerable delay induced by the time employed to perform the automatic network reconfiguration. Descriptions of the hardware network architecture and detailed synthetic and execution-driven simulations will demonstrate the benefits of Immunet.

本文提出了一种新的、有效的机制来容忍并行和分布式计算机互连网络中的故障，称为免疫网络。在出现故障时，Immunet自动对幸存的网络资源进行硬件重新配置。与以前的容错切换机制相比，Immunet有四个重要的优点。它的低硬件成本将网络在无故障情况下必须支持的开销降至最低。只要网络保持连接，Immunet就可以容忍任何数量的故障，无论它们的空间和时间组合如何。由此产生的通信基础设施在幸存的拓扑上提供了优化的自适应最小路由。连续故障下的系统行为表现出优雅的性能退化。免疫网络的重新配置对于运行在并行系统上的应用程序来说是完全透明的，因为它们只会受到通过损坏组件的那些数据包丢失的影响。由于执行自动网络重新配置所花费的时间，其余的数据包将只遭受可容忍的延迟。硬件网络体系结构的描述以及详细的合成和执行驱动仿真将展示Immunet的优点。

{"title":"Immunet: a cheap and robust fault-tolerant packet routing mechanism","authors":"Valentin Puente, J. Gregorio, F. Vallejo, R. Beivide","doi":"10.1145/1028176.1006718","DOIUrl":"https://doi.org/10.1145/1028176.1006718","url":null,"abstract":"A new and efficient mechanism to tolerate failures in interconnection networks for parallel and distributed computers, denoted as Immunet, is presented in this work. In the presence of failures, Immunet automatically reacts with a hardware reconfiguration of the surviving network resources. Immunet has four important advantages over previous fault-tolerant switching mechanisms. Its low hardware costs minimize the overhead that the network must support in absence of faults. As long as the network remains connected, Immunet can tolerate any number of failures regardless of their spatial and temporal combinations. The resulting communication infrastructure provides optimized adaptive minimal routing over the surviving topology. The system behavior under successive failures exhibits graceful performance degradation. Immunet reconfiguration can be totally transparent to the applications running on the parallel system as they will only be affected by the loss of those data packets circulating through the broken components. The rest of the packets will suffer only a tolerable delay induced by the time employed to perform the automatic network reconfiguration. Descriptions of the hardware network architecture and detailed synthetic and execution-driven simulations will demonstrate the benefits of Immunet.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124711972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 108