首页 > 最新文献

10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

英文 中文
Perceptron-Based Branch Confidence Estimation 基于感知器的分支置信度估计
Haitham Akkary, Srikanth T. Srinivasan, Rajendar Koltur, Yogesh Patil, Wael Refaai
Pipeline gating has been proposed for reducing wasted speculative execution due to branch mispredictions. As processors become deeper or wider, pipeline gating becomes more important because the amount of wasted speculative execution increases. The quality of pipeline gating relies heavily on the branch confidence estimator used. Not much work has been done on branch confidence estimators since the initial work [6]. We show the accuracy and coverage characteristics of the initial proposals do not sufficiently reduce mis-speculative execution on future deep pipeline processors. In this paper, we present a new, perceptron-based, branch confidence estimator, which is twice as accurate as the current best-known method and achieves reasonable mispredicted branch coverage. Further, the output of our predictor is multi-valued, which enables us to classify branches further as "strongly low confident" and "weakly low confident". We reverse the predictions of "strongly low confident" branches and apply pipeline gating to the "weakly low confident" branches. This combination of pipeline gating and branch reversal provides a spectrum of interesting design options ranging from significantly reducing total execution for only a small performance loss, to lower but still significant reductions in total execution, without any performance loss.
管道门控被提出用于减少由于分支错误预测而浪费的推测执行。随着处理器变得更深或更宽,管道门控变得更加重要,因为浪费的推测执行量增加了。管道门控的质量很大程度上取决于所使用的分支置信估计器。自最初的工作[6]以来,在分支置信度估计器上没有做太多的工作。我们表明,最初建议的准确性和覆盖特性不足以减少未来深管道处理器的错误推测执行。在本文中,我们提出了一种新的、基于感知器的分支置信度估计器,其精度是目前最知名方法的两倍,并实现了合理的误预测分支覆盖率。此外,我们的预测器的输出是多值的,这使我们能够进一步将分支分类为“强低自信”和“弱低自信”。我们逆转了“强低自信”分支的预测,并将管道门控应用于“弱低自信”分支。这种管道门控和分支反转的组合提供了一系列有趣的设计选项,从显著减少总执行量而只造成很小的性能损失,到减少总执行量但仍然显著减少而不造成任何性能损失。
{"title":"Perceptron-Based Branch Confidence Estimation","authors":"Haitham Akkary, Srikanth T. Srinivasan, Rajendar Koltur, Yogesh Patil, Wael Refaai","doi":"10.1109/HPCA.2004.10002","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10002","url":null,"abstract":"Pipeline gating has been proposed for reducing wasted speculative execution due to branch mispredictions. As processors become deeper or wider, pipeline gating becomes more important because the amount of wasted speculative execution increases. The quality of pipeline gating relies heavily on the branch confidence estimator used. Not much work has been done on branch confidence estimators since the initial work [6]. We show the accuracy and coverage characteristics of the initial proposals do not sufficiently reduce mis-speculative execution on future deep pipeline processors. In this paper, we present a new, perceptron-based, branch confidence estimator, which is twice as accurate as the current best-known method and achieves reasonable mispredicted branch coverage. Further, the output of our predictor is multi-valued, which enables us to classify branches further as \"strongly low confident\" and \"weakly low confident\". We reverse the predictions of \"strongly low confident\" branches and apply pipeline gating to the \"weakly low confident\" branches. This combination of pipeline gating and branch reversal provides a spectrum of interesting design options ranging from significantly reducing total execution for only a small performance loss, to lower but still significant reductions in total execution, without any performance loss.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121616332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Organizing the last line of defense before hitting the memory wall for CMPs 在击中cmp的记忆墙之前组织最后一道防线
Chun Liu, A. Sivasubramaniam, M. Kandemir
The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).
在进入片外存储器之前,缓存层次结构中的最后一道防线在芯片多处理器(cmp)中从性能和功耗的角度来看都是非常关键的。为了减少片外内存访问,我们研究了不同组织的最后一道防线(本文中假定为L2)。我们评估了私有L2和地址交错共享L2设计之间的权衡,注意到它们各自的优点和缺点。跨cpu的L2需求之间可能存在的不平衡有利于共享的L2组织,而这些需求之间的干扰有利于私有的L2组织。我们提出了一种新的架构,称为基于共享处理器的分割L2,它抓住了这两种组织的优点,同时避免了它们的许多缺点。通过在一个完整的系统模拟器上使用SPEC OMP套件中的几个应用程序和一个商业基准测试Specjbb,我们演示了这种基于共享处理器的L2组织的好处。我们的结果显示,IPC比私有组织提高了42.50%(平均为11.52%),比共享的交错组织提高了42.22%(平均为9.76%)。
{"title":"Organizing the last line of defense before hitting the memory wall for CMPs","authors":"Chun Liu, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/HPCA.2004.10017","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10017","url":null,"abstract":"The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 183
Improving Disk Throughput in Data-Intensive Servers 提高数据密集型服务器的磁盘吞吐量
E. V. Carrera, R. Bianchini
Low disk throughput is one of the main impediments to improving the performance of data-intensive servers. In this paper, we propose two management techniques for the disk controller cache that can significantly increase disk throughput. The first technique, called File-Oriented Read-ahead (FOR), adjusts the number of read-ahead blocks brought into the disk controller cache according to file system information. The second technique, called Host-guided Device Caching (HDC), gives the host control over part of the disk controller cache. As an example use of this mechanism, we keep the blocks that cause the most misses in the host buffer cache permanently cached in the disk controller. Our detailed simulations of real server workloads show that FOR and HDC can increase disk throughput by up to 34% and 24%, respectively, in comparison to conventional disk controller cache management techniques. When combined, the techniques can increase throughput by up to 47%.
低磁盘吞吐量是提高数据密集型服务器性能的主要障碍之一。在本文中,我们提出了两种可以显著提高磁盘吞吐量的磁盘控制器缓存管理技术。第一种技术称为面向文件的预读(FOR),它根据文件系统信息调整进入磁盘控制器缓存的预读块的数量。第二种技术称为主机引导设备缓存(HDC),它让主机控制部分磁盘控制器缓存。作为该机制的一个示例,我们将导致主机缓冲区缓存中丢失最多的块永久地缓存在磁盘控制器中。我们对真实服务器工作负载的详细模拟表明,与传统的磁盘控制器缓存管理技术相比,FOR和HDC可以分别将磁盘吞吐量提高34%和24%。当结合使用时,这些技术可以将吞吐量提高47%。
{"title":"Improving Disk Throughput in Data-Intensive Servers","authors":"E. V. Carrera, R. Bianchini","doi":"10.1109/HPCA.2004.10023","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10023","url":null,"abstract":"Low disk throughput is one of the main impediments to improving the performance of data-intensive servers. In this paper, we propose two management techniques for the disk controller cache that can significantly increase disk throughput. The first technique, called File-Oriented Read-ahead (FOR), adjusts the number of read-ahead blocks brought into the disk controller cache according to file system information. The second technique, called Host-guided Device Caching (HDC), gives the host control over part of the disk controller cache. As an example use of this mechanism, we keep the blocks that cause the most misses in the host buffer cache permanently cached in the disk controller. Our detailed simulations of real server workloads show that FOR and HDC can increase disk throughput by up to 34% and 24%, respectively, in comparison to conventional disk controller cache management techniques. When combined, the techniques can increase throughput by up to 47%.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
A low-complexity, high-performance fetch unit for simultaneous multithreading processors 一种用于并发多线程处理器的低复杂度、高性能读取单元
Ayose Falcón, Alex Ramírez, M. Valero
Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.
同步多线程(SMT)是一种允许同时并行执行多个线程的体系结构技术。获取性能被认为是SMT处理器最重要的瓶颈。通常采用的解决方案是每个周期从多个线程获取。最近的研究提出了大量的读取策略来处理线程之间的读取优先级,试图提高读取性能。我们证明了同时共享获取单元,除了增加获取单元的复杂性之外,在性能方面可能会适得其反。我们评估了在SMT上下文中高性能获取单元的使用。我们的新获取架构建议允许我们每个周期从单个线程提供8路处理器获取,从而降低了复杂性,并增加了所建议的获取策略的实用性。我们的结果表明,使用新的高性能获取单元,如FTB或流获取,比使用普通SMT获取架构从两个线程中获取提供更高的性能。此外,我们的结果表明,与之前提出的解决方案相比,我们的设计在任何类型的工作负载(包括ILP和内存受限基准测试)上都获得了更好的平均性能。
{"title":"A low-complexity, high-performance fetch unit for simultaneous multithreading processors","authors":"Ayose Falcón, Alex Ramírez, M. Valero","doi":"10.1109/HPCA.2004.10003","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10003","url":null,"abstract":"Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129664719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Hardware Support for Prescient Instruction Prefetch 预知指令预取的硬件支持
Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen
This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.
本文提出并评估了支持预先指令预取的硬件机制——一种通过使用辅助线程执行指令预取来提高单线程应用程序性能的方法。我们演示了当直接预执行遭受I-cache缺失的应用程序的未来区域时,启用存储-加载通信和选择性指令执行的必要性。为了满足这些要求,引入了两种新颖的硬件机制,即安全存储和YAT-bits。本文还提出并评估了.nite状态机召回,这是一种通过利用计数i预取机制将预执行限制在难以预测的分支上的技术。在具有下一行和流i预取机制的Itanium®SMT处理器上,有先见之明的指令预取可以在一组SPEC 2000基准测试中平均提高10.0%到22%的性能,这些基准测试会导致严重的i缓存丢失。有预见性的指令预取被发现可以与最先进的硬件指令预取技术相竞争:直接指令预取。
{"title":"Hardware Support for Prescient Instruction Prefetch","authors":"Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen","doi":"10.1109/HPCA.2004.10028","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10028","url":null,"abstract":"This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115195813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management 通过功率感知缓存管理降低磁盘存储能耗
Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao
Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.
降低能耗是数据中心的一个重要问题。在数据中心的各种组件中,存储是最大的能源消耗者之一。以前的研究表明,数据中心中服务器磁盘的平均空闲时间与上下旋转所需的时间相比非常小。这极大地限制了磁盘电源管理方案的有效性。本文提出了几种功率感知存储缓存管理算法,为底层磁盘电源管理方案提供了更多的节能机会。更具体地说,我们提出了一个离线的功率感知贪婪算法,它比Belady的离线算法(它只最小化缓存丢失)更节能。我们还提出了一种在线功率感知缓存替换算法。我们的跟踪驱动模拟表明,与LRU相比,我们的算法节省了16%的磁盘能量,并为OLTP I/O工作负载提供了50%的平均响应时间。我们还研究了四种存储缓存写策略对磁盘能耗的影响。
{"title":"Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management","authors":"Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao","doi":"10.1109/HPCA.2004.10022","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10022","url":null,"abstract":"Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 242
Data Cache Prefetching Using a Global History Buffer 使用全局历史缓冲区的数据缓存预取
Kyle J. Nesbit, James E. Smith
A new structure for implementing data cache prefetching is proposed and analyzed via simulation. The structure is based on a Global History Buffer that holds the most recent miss addresses in FIFO order. Linked lists within this global history buffer connect addresses that have some common property, e.g. they were all generated by the same load instruction. The Global History Buffer can be used for implementing a number of previously proposed prefetch methods, as well as new ones. Prefetching with the Global History Buffer has two significant advantages over conventional table prefetching methods. First, the use of a FIFO history buffer can improve the accuracy of correlation prefetching by eliminating stale data from the table. Second, the Global History Buffer contains a more complete (and intact) picture of cache miss history, creating opportunities to design more effective prefetching methods. Global History Buffer prefetching can increase correlation prefetching performance by 20% and cut its memory traffic by 90%. Furthermore, the Global History Buffer can make correlations within a load’s address stream, which can increase stride prefetching performance by 6%. Collectively, the Global History Buffer prefetching methods perform as well or better than the conventional prefetching methods studied on 14 of 15 benchmarks.
提出了一种新的数据缓存预取实现结构,并进行了仿真分析。该结构基于全局历史缓冲区,该缓冲区以FIFO顺序保存最近的丢失地址。这个全局历史缓冲区中的链表连接有一些共同属性的地址,例如,它们都是由相同的加载指令生成的。全局历史缓冲区可以用于实现许多先前提出的预取方法,以及新的方法。与传统的表预取方法相比,使用全局历史缓冲区预取有两个显著的优点。首先,FIFO历史缓冲区的使用可以通过消除表中的陈旧数据来提高相关预取的准确性。其次,全局历史缓冲区包含了更完整的(和完整的)缓存丢失历史,这为设计更有效的预取方法创造了机会。全局历史缓冲区预取可以将相关预取性能提高20%,并将其内存流量减少90%。此外,全局历史缓冲区可以在负载的地址流中进行关联,这可以将步长预取性能提高6%。总的来说,在15个基准测试中的14个中,Global History Buffer预取方法的性能与传统预取方法一样好,甚至更好。
{"title":"Data Cache Prefetching Using a Global History Buffer","authors":"Kyle J. Nesbit, James E. Smith","doi":"10.1109/MM.2005.6","DOIUrl":"https://doi.org/10.1109/MM.2005.6","url":null,"abstract":"A new structure for implementing data cache prefetching is proposed and analyzed via simulation. The structure is based on a Global History Buffer that holds the most recent miss addresses in FIFO order. Linked lists within this global history buffer connect addresses that have some common property, e.g. they were all generated by the same load instruction. The Global History Buffer can be used for implementing a number of previously proposed prefetch methods, as well as new ones. Prefetching with the Global History Buffer has two significant advantages over conventional table prefetching methods. First, the use of a FIFO history buffer can improve the accuracy of correlation prefetching by eliminating stale data from the table. Second, the Global History Buffer contains a more complete (and intact) picture of cache miss history, creating opportunities to design more effective prefetching methods. Global History Buffer prefetching can increase correlation prefetching performance by 20% and cut its memory traffic by 90%. Furthermore, the Global History Buffer can make correlations within a load’s address stream, which can increase stride prefetching performance by 6%. Collectively, the Global History Buffer prefetching methods perform as well or better than the conventional prefetching methods studied on 14 of 15 benchmarks.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122993672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 375
Reducing branch misprediction penalty via selective branch recovery 通过选择性分支恢复减少分支错误预测惩罚
A. Gandhi, Haitham Akkary, Srikanth T. Srinivasan
Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.
分支错误预测的惩罚包括两个部分:在错误预测的分支解决之前浪费在错误推测执行上的时间,以及在解决了分支后用有用指令重新启动管道的时间。当前的处理器趋势,大指令窗口和深管道,放大了分支错误预测惩罚的两个组成部分。我们提出了一种新的方法,称为选择性分支恢复(SBR),以减少分支错误预测惩罚的两个组成部分。SBR利用了一种经常出现的控制独立性——精确收敛——错误预测的路径在正确路径的起点精确收敛。在这种情况下,SBR有选择地重用在错误推测执行期间计算的结果,并且避免了再次获取或重命名收敛指令的需要。因此,SBR解决了分支错误预测惩罚的两个组成部分。为了增加分支错误预测的可能性,我们还提出了一种有效的方法来诱导错误预测路径上的精确收敛。使用SBR,我们在广泛的基准测试中显著提高了性能(在3%-22%之间,平均8%之间),而我们的基准处理器没有利用SBR。
{"title":"Reducing branch misprediction penalty via selective branch recovery","authors":"A. Gandhi, Haitham Akkary, Srikanth T. Srinivasan","doi":"10.1109/HPCA.2004.10004","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10004","url":null,"abstract":"Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Exploiting prediction to reduce power on buses 利用预测来减少公共汽车的电力
V. Wen, M. Whitney, Yatish Patel, J. Kubiatowicz
We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 /spl mu/m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5 mm at 0.13 /spl mu/ and project a break-even point of 2.7 mm for a larger design at 0.07 /spl mu/.
我们研究编码技术以减少微处理器片上总线所消耗的能量。我们探索了几种简单的编码方案,并使用改进的SimpleScalar模拟器和SPEC基准测试对它们进行了模拟。我们显示,在内部公交车上的转换平均节省了35%。为了量化实际节省的功耗,我们设计了一个基于字典的编码器/解码器电路,在0.13 /spl mu/m的过程中,将其提取为网络列表,并在SPICE下模拟其行为。利用具有中继器的真实线模型,我们表明我们可以在0.13 /spl mu/的中位线长度小于11.5 mm的尺度下实现收支平衡,并且在0.07 /spl mu/的较大设计中预测2.7 mm的收支平衡点。
{"title":"Exploiting prediction to reduce power on buses","authors":"V. Wen, M. Whitney, Yatish Patel, J. Kubiatowicz","doi":"10.1109/HPCA.2004.10025","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10025","url":null,"abstract":"We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 /spl mu/m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5 mm at 0.13 /spl mu/ and project a break-even point of 2.7 mm for a larger design at 0.07 /spl mu/.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Out-of-order commit processors 乱序提交处理器
A. Cristal, Daniel Ortega, J. Llosa, M. Valero
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.
现代乱序处理器通过支持大量运行中的指令来容忍长延迟的内存操作。这在分支推测通常不是问题的数值应用程序中特别有用,并且缓存层次结构不能足够快地传递数据。为了支持更多的运行中的指令,必须增大一些资源的大小,例如重新排序缓冲区(ROB)、通用指令队列、加载/存储队列和处理器中的物理寄存器数量。然而,由于面积、周期时间和功耗限制,扩大这些资源中的条目数量是不切实际的。我们建议通过增加飞行指令的数量来增加未来处理器的容量。我们不是简单地增加资源,而是推动新的和新颖的微架构结构,这些结构可以实现相同的性能优势,但对资源的需求要低得多。我们的主要贡献是一种新的检查点机制,它能够以几乎恒定的成本保存数千条飞行指令。我们还提出了一种利用流中指令等待时间差异的排队机制。使用这两种机制,我们的处理器在SPEC2000fp上的性能下降仅为传统处理器的10%,而传统处理器需要在ROB和指令队列中增加一个数量级以上的条目,并且比具有类似条目数量的当前处理器提高了约200%。
{"title":"Out-of-order commit processors","authors":"A. Cristal, Daniel Ortega, J. Llosa, M. Valero","doi":"10.1109/HPCA.2004.10008","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10008","url":null,"abstract":"Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124654753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 148
期刊
10th International Symposium on High Performance Computer Architecture (HPCA'04)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1