Pipeline gating has been proposed for reducing wasted speculative execution due to branch mispredictions. As processors become deeper or wider, pipeline gating becomes more important because the amount of wasted speculative execution increases. The quality of pipeline gating relies heavily on the branch confidence estimator used. Not much work has been done on branch confidence estimators since the initial work [6]. We show the accuracy and coverage characteristics of the initial proposals do not sufficiently reduce mis-speculative execution on future deep pipeline processors. In this paper, we present a new, perceptron-based, branch confidence estimator, which is twice as accurate as the current best-known method and achieves reasonable mispredicted branch coverage. Further, the output of our predictor is multi-valued, which enables us to classify branches further as "strongly low confident" and "weakly low confident". We reverse the predictions of "strongly low confident" branches and apply pipeline gating to the "weakly low confident" branches. This combination of pipeline gating and branch reversal provides a spectrum of interesting design options ranging from significantly reducing total execution for only a small performance loss, to lower but still significant reductions in total execution, without any performance loss.
{"title":"Perceptron-Based Branch Confidence Estimation","authors":"Haitham Akkary, Srikanth T. Srinivasan, Rajendar Koltur, Yogesh Patil, Wael Refaai","doi":"10.1109/HPCA.2004.10002","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10002","url":null,"abstract":"Pipeline gating has been proposed for reducing wasted speculative execution due to branch mispredictions. As processors become deeper or wider, pipeline gating becomes more important because the amount of wasted speculative execution increases. The quality of pipeline gating relies heavily on the branch confidence estimator used. Not much work has been done on branch confidence estimators since the initial work [6]. We show the accuracy and coverage characteristics of the initial proposals do not sufficiently reduce mis-speculative execution on future deep pipeline processors. In this paper, we present a new, perceptron-based, branch confidence estimator, which is twice as accurate as the current best-known method and achieves reasonable mispredicted branch coverage. Further, the output of our predictor is multi-valued, which enables us to classify branches further as \"strongly low confident\" and \"weakly low confident\". We reverse the predictions of \"strongly low confident\" branches and apply pipeline gating to the \"weakly low confident\" branches. This combination of pipeline gating and branch reversal provides a spectrum of interesting design options ranging from significantly reducing total execution for only a small performance loss, to lower but still significant reductions in total execution, without any performance loss.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121616332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).
{"title":"Organizing the last line of defense before hitting the memory wall for CMPs","authors":"Chun Liu, A. Sivasubramaniam, M. Kandemir","doi":"10.1109/HPCA.2004.10017","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10017","url":null,"abstract":"The last line of defense in the cache hierarchy before going to off-chip memory is very critical in chip multiprocessors (CMPs) from both the performance and power perspectives. We investigate different organizations for this last line of defense (assumed to be L2 in this article) towards reducing off-chip memory accesses. We evaluate the trade-offs between private L2 and address-interleaved shared L2 designs, noting their individual benefits and drawbacks. The possible imbalance between the L2 demands across the CPUs favors a shared L2 organization, while the interference between these demands can favor a private L2 organization. We propose a new architecture, called Shared Processor-Based Split L2, that captures the benefits of these two organizations, while avoiding many of their drawbacks. Using several applications from the SPEC OMP suite and a commercial benchmark, Specjbb, on a complete system simulator, we demonstrate the benefits of this shared processor-based L2 organization. Our results show as much as 42.50% improvement in IPC over the private organization (with 11.52% on the average), and as much as 42.22% improvement over the shared interleaved organization (with 9.76% on the average).","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132269085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low disk throughput is one of the main impediments to improving the performance of data-intensive servers. In this paper, we propose two management techniques for the disk controller cache that can significantly increase disk throughput. The first technique, called File-Oriented Read-ahead (FOR), adjusts the number of read-ahead blocks brought into the disk controller cache according to file system information. The second technique, called Host-guided Device Caching (HDC), gives the host control over part of the disk controller cache. As an example use of this mechanism, we keep the blocks that cause the most misses in the host buffer cache permanently cached in the disk controller. Our detailed simulations of real server workloads show that FOR and HDC can increase disk throughput by up to 34% and 24%, respectively, in comparison to conventional disk controller cache management techniques. When combined, the techniques can increase throughput by up to 47%.
{"title":"Improving Disk Throughput in Data-Intensive Servers","authors":"E. V. Carrera, R. Bianchini","doi":"10.1109/HPCA.2004.10023","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10023","url":null,"abstract":"Low disk throughput is one of the main impediments to improving the performance of data-intensive servers. In this paper, we propose two management techniques for the disk controller cache that can significantly increase disk throughput. The first technique, called File-Oriented Read-ahead (FOR), adjusts the number of read-ahead blocks brought into the disk controller cache according to file system information. The second technique, called Host-guided Device Caching (HDC), gives the host control over part of the disk controller cache. As an example use of this mechanism, we keep the blocks that cause the most misses in the host buffer cache permanently cached in the disk controller. Our detailed simulations of real server workloads show that FOR and HDC can increase disk throughput by up to 34% and 24%, respectively, in comparison to conventional disk controller cache management techniques. When combined, the techniques can increase throughput by up to 47%.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126874062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.
{"title":"A low-complexity, high-performance fetch unit for simultaneous multithreading processors","authors":"Ayose Falcón, Alex Ramírez, M. Valero","doi":"10.1109/HPCA.2004.10003","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10003","url":null,"abstract":"Simultaneous multithreading (SMT) is an architectural technique that allows for the parallel execution of several threads simultaneously. Fetch performance has been identified as the most important bottleneck for SMT processors. The commonly adopted solution has been fetching from more than one thread each cycle. Recent studies have proposed a plethora of fetch policies to deal with fetch priority among threads, trying to increase fetch performance. We demonstrate that the simultaneous sharing of the fetch unit, apart from increasing the complexity of the fetch unit, can be counterproductive in terms of performance. We evaluate the use of high-performance fetch units in the context of SMT. Our new fetch architecture proposal allows us to feed an 8-way processor fetching from a single thread each cycle, reducing complexity, and increasing the usefulness of proposed fetch policies. Our results show that using new high-performance fetch units, like the FTB or the stream fetch, provides higher performance than fetching from two threads using common SMT fetch architectures. Furthermore, our results show that our design obtains better average performance for any kind of workloads (both ILP and memory bounded benchmarks), in contrast to previously proposed solutions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129664719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen
This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.
{"title":"Hardware Support for Prescient Instruction Prefetch","authors":"Tor M. Aamodt, P. Chow, Per Hammarlund, Hong Wang, John Paul Shen","doi":"10.1109/HPCA.2004.10028","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10028","url":null,"abstract":"This paper proposes and evaluates hardware mechanisms for supporting prescient instruction prefetch — an approach to improving single-threaded application performance by using helper threads to perform instruction prefetch. We demonstrate the need for enabling store-to-load communication and selective instruction execution when directly pre-executing future regions of an application that suffer I-cache misses. Two novel hardware mechanisms, safe-store and YAT-bits, are introduced that help satisfy these requirements. This paper also proposes and evaluates .nite state machine recall, a technique for limiting pre-execution to branches that are hard to predict by leveraging a counted I-prefetch mechanism. On a research Itanium®SMT processor with next line and streaming I-prefetch mechanisms that incurs latencies representative of next generation processors, prescient instruction prefetch can improve performance by an average of 10.0% to 22% on a set of SPEC 2000 benchmarks that suffer significant I-cache misses. Prescient instruction prefetch is found to be competitive against even the most aggressive research hardware instruction prefetch technique: fetch directed instruction prefetch.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115195813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao
Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.
{"title":"Reducing Energy Consumption of Disk Storage Using Power-Aware Cache Management","authors":"Qingbo Zhu, Francis M. David, Christo Frank Devaraj, Zhenmin Li, Yuanyuan Zhou, P. Cao","doi":"10.1109/HPCA.2004.10022","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10022","url":null,"abstract":"Reducing energy consumption is an important issue for data centers. Among the various components of a data center, storage is one of the biggest consumers of energy. Previous studies have shown that the average idle period for a server disk in a data center is very small compared to the time taken to spin down and spin up. This significantly limits the effectiveness of disk power management schemes. This paper proposes several power-aware storage cache management algorithms that provide more opportunities for the underlying disk power management schemes to save energy. More specifically, we present an off-line power-aware greedy algorithm that is more energy-efficient than Belady’s off-line algorithm (which minimizes cache misses only). We also propose an online power-aware cache replacement algorithm. Our trace-driven simulations show that, compared to LRU, our algorithm saves 16% more disk energy and provides 50% better average response time for OLTP I/O workloads. We have also investigated the effects of four storage cache write policies on disk energy consumption.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115274180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A new structure for implementing data cache prefetching is proposed and analyzed via simulation. The structure is based on a Global History Buffer that holds the most recent miss addresses in FIFO order. Linked lists within this global history buffer connect addresses that have some common property, e.g. they were all generated by the same load instruction. The Global History Buffer can be used for implementing a number of previously proposed prefetch methods, as well as new ones. Prefetching with the Global History Buffer has two significant advantages over conventional table prefetching methods. First, the use of a FIFO history buffer can improve the accuracy of correlation prefetching by eliminating stale data from the table. Second, the Global History Buffer contains a more complete (and intact) picture of cache miss history, creating opportunities to design more effective prefetching methods. Global History Buffer prefetching can increase correlation prefetching performance by 20% and cut its memory traffic by 90%. Furthermore, the Global History Buffer can make correlations within a load’s address stream, which can increase stride prefetching performance by 6%. Collectively, the Global History Buffer prefetching methods perform as well or better than the conventional prefetching methods studied on 14 of 15 benchmarks.
提出了一种新的数据缓存预取实现结构,并进行了仿真分析。该结构基于全局历史缓冲区,该缓冲区以FIFO顺序保存最近的丢失地址。这个全局历史缓冲区中的链表连接有一些共同属性的地址,例如,它们都是由相同的加载指令生成的。全局历史缓冲区可以用于实现许多先前提出的预取方法,以及新的方法。与传统的表预取方法相比,使用全局历史缓冲区预取有两个显著的优点。首先,FIFO历史缓冲区的使用可以通过消除表中的陈旧数据来提高相关预取的准确性。其次,全局历史缓冲区包含了更完整的(和完整的)缓存丢失历史,这为设计更有效的预取方法创造了机会。全局历史缓冲区预取可以将相关预取性能提高20%,并将其内存流量减少90%。此外,全局历史缓冲区可以在负载的地址流中进行关联,这可以将步长预取性能提高6%。总的来说,在15个基准测试中的14个中,Global History Buffer预取方法的性能与传统预取方法一样好,甚至更好。
{"title":"Data Cache Prefetching Using a Global History Buffer","authors":"Kyle J. Nesbit, James E. Smith","doi":"10.1109/MM.2005.6","DOIUrl":"https://doi.org/10.1109/MM.2005.6","url":null,"abstract":"A new structure for implementing data cache prefetching is proposed and analyzed via simulation. The structure is based on a Global History Buffer that holds the most recent miss addresses in FIFO order. Linked lists within this global history buffer connect addresses that have some common property, e.g. they were all generated by the same load instruction. The Global History Buffer can be used for implementing a number of previously proposed prefetch methods, as well as new ones. Prefetching with the Global History Buffer has two significant advantages over conventional table prefetching methods. First, the use of a FIFO history buffer can improve the accuracy of correlation prefetching by eliminating stale data from the table. Second, the Global History Buffer contains a more complete (and intact) picture of cache miss history, creating opportunities to design more effective prefetching methods. Global History Buffer prefetching can increase correlation prefetching performance by 20% and cut its memory traffic by 90%. Furthermore, the Global History Buffer can make correlations within a load’s address stream, which can increase stride prefetching performance by 6%. Collectively, the Global History Buffer prefetching methods perform as well or better than the conventional prefetching methods studied on 14 of 15 benchmarks.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122993672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.
{"title":"Reducing branch misprediction penalty via selective branch recovery","authors":"A. Gandhi, Haitham Akkary, Srikanth T. Srinivasan","doi":"10.1109/HPCA.2004.10004","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10004","url":null,"abstract":"Branch misprediction penalty consists of two components: the time wasted on misspeculative execution until the mispredicted branch is resolved and the time to restart the pipeline with useful instructions once the branch is resolved. Current processor trends, large instruction windows and deep pipelines, amplify both components of the branch misprediction penalty. We propose a novel method, called selective branch recovery (SBR), to reduce both components of branch misprediction penalty. SBR exploits a frequently occurring type of control independence - exact convergence - where the mispredicted path converges exactly at the beginning of the correct path. In such cases, SBR selectively reuses the results computed during misspeculative execution and obviates the need to fetch or rename convergent instructions again. Thus, SBR addresses both components of branch misprediction penalty. To increase the likelihood of branch mispredictions that can be handled with SBR, we also present an effective means for inducing exact convergence on misspeculative paths. With SBR, we significantly improve performance (between 3%-22%, average 8%) on a wide range of benchmarks over our baseline processor that does not exploit SBR.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115713084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 /spl mu/m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5 mm at 0.13 /spl mu/ and project a break-even point of 2.7 mm for a larger design at 0.07 /spl mu/.
{"title":"Exploiting prediction to reduce power on buses","authors":"V. Wen, M. Whitney, Yatish Patel, J. Kubiatowicz","doi":"10.1109/HPCA.2004.10025","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10025","url":null,"abstract":"We investigate coding techniques to reduce the energy consumed by on-chip buses in a microprocessor. We explore several simple coding schemes and simulate them using a modified SimpleScalar simulator and SPEC benchmarks. We show an average of 35% savings in transitions on internal buses. To quantify actual power savings, we design a dictionary based encoder/decoder circuit in a 0.13 /spl mu/m process, extract it as a netlist, and simulate its behavior under SPICE. Utilizing a realistic wire model with repeaters, we show that we can break even at median wire length scales of less than 11.5 mm at 0.13 /spl mu/ and project a break-even point of 2.7 mm for a larger design at 0.07 /spl mu/.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124842708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.
{"title":"Out-of-order commit processors","authors":"A. Cristal, Daniel Ortega, J. Llosa, M. Valero","doi":"10.1109/HPCA.2004.10008","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10008","url":null,"abstract":"Modern out-of-order processors tolerate long latency memory operations by supporting a large number of in-flight instructions. This is particularly useful in numerical applications where branch speculation is normally not a problem and where the cache hierarchy is not capable of delivering the data soon enough. In order to support more in-flight instructions, several resources have to be up-sized, such as the reorder buffer (ROB), the general purpose instructions queues, the load/store queue and the number of physical registers in the processor. However, scaling-up the number of entries in these resources is impractical because of area, cycle time, and power consumption constraints. We propose to increase the capacity of future processors by augmenting the number of in-flight instructions. Instead of simply up-sizing resources, we push for new and novel microarchitectural structures that achieve the same performance benefits but with a much lower need for resources. Our main contribution is a new checkpointing mechanism that is capable of keeping thousands of in-flight instructions at a practically constant cost. We also propose a queuing mechanism that takes advantage of the differences in waiting time of the instructions in the flow. Using these two mechanisms our processor has a performance degradation of only 10% for SPEC2000fp over a conventional processor requiring more than an order of magnitude additional entries in the ROB and instruction queues, and about a 200% improvement over a current processor with a similar number of entries.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124654753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}