As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.
{"title":"Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization","authors":"R. Joseph, Zhigang Hu, M. Martonosi","doi":"10.1109/HPCA.2004.10027","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10027","url":null,"abstract":"As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115091195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai
As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.
{"title":"Processor Aware Anticipatory Prefetching in Loops","authors":"Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai","doi":"10.1109/HPCA.2004.10029","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10029","url":null,"abstract":"As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122325564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.
{"title":"Reducing the Scheduling Critical Cycle Using Wakeup Prediction","authors":"Todd E. Ehrhart, Sanjay J. Patel","doi":"10.1109/HPCA.2004.10016","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10016","url":null,"abstract":"For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128316964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.
大多数当前和下一代服务器应用程序(Web服务、电子商务、存储等)都采用TCP/IP作为首选的通信协议。因此,这些应用程序的性能在很大程度上依赖于终端节点内有效的TCP/IP数据包处理。随着这些应用的带宽需求在不久的将来从100 Mbps增长到1 Gbps,再到10 Gbps,这种依赖性会变得更大。受此启发,我们专注于以下方面:(a)了解TCP/IP处理的各种模式的性能行为,(b)分析TCP/IP数据包处理的底层架构特征,以及(c)在实际工作负载中量化TCP/IP数据包处理组件的计算需求。我们通过对运行微软Windows* Server 2003操作系统的英特尔最先进的低功耗Pentium/spl reg/ M微处理器的数据包处理性能进行深入分析来实现这些目标。我们的一些关键观察是——(i) TCP/IP操作模式可以显著影响性能要求,(ii)与接收端处理相比,发送端处理在很大程度上是计算密集型的,而接收端处理更受内存限制,(iii)发送/接收数据包的计算需求可以构成商业服务器工作负载的重要组成部分(28%至40%)。从我们的分析中,我们还讨论了架构以及与堆栈相关的改进,这些改进可以帮助实现更高的服务器网络吞吐量并提高应用程序性能。
{"title":"Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor","authors":"S. Makineni, R. Iyer","doi":"10.1109/HPCA.2004.10024","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10024","url":null,"abstract":"A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We focus on generating efficient software pipelined schedules for in-order machines, which we call converged trace schedules. For a candidate loop, we form a string of trace block identifiers by hashing together addresses of aggressively scheduled instructions from multiple iterations of a loop. In this process, the loop is unrolled and scheduled until we identify a repeating pattern in the string. Instructions corresponding to this repeating pattern form the kernel for our software pipelined schedule. We evaluate this approach to create aggressive schedules by using it in dynamic hardware and software optimization systems for an in-order architecture.
{"title":"Creating converged trace schedules using string matching","authors":"S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder","doi":"10.1109/HPCA.2004.10012","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10012","url":null,"abstract":"We focus on generating efficient software pipelined schedules for in-order machines, which we call converged trace schedules. For a candidate loop, we form a string of trace block identifiers by hashing together addresses of aggressively scheduled instructions from multiple iterations of a loop. In this process, the loop is unrolled and scheduled until we identify a repeating pattern in the string. Instructions corresponding to this repeating pattern form the kernel for our software pipelined schedule. We evaluate this approach to create aggressive schedules by using it in dynamic hardware and software optimization systems for an in-order architecture.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124502683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos
Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.
{"title":"Accurate and complexity-effective spatial pattern prediction","authors":"Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos","doi":"10.1109/HPCA.2004.10010","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10010","url":null,"abstract":"Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall on-chip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called "splittability", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.
{"title":"Exploiting the cache capacity of a single-chip multi-core processor with execution migration","authors":"P. Michaud","doi":"10.1109/HPCA.2004.10026","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10026","url":null,"abstract":"We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall on-chip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called \"splittability\", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122188916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.
{"title":"The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors","authors":"Jian Li, José F. Martínez, Michael C. Huang","doi":"10.1109/HPCA.2004.10018","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10018","url":null,"abstract":"Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134341284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. We introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small signature buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (instruction-per-cycle) improvement of 13-18% is possible using a small 8-entry signature buffer.
{"title":"Signature buffer: bridging performance gap between registers and caches","authors":"Lu Peng, J. Peir, K. Lai","doi":"10.1109/HPCA.2004.10020","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10020","url":null,"abstract":"Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. We introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small signature buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (instruction-per-cycle) improvement of 13-18% is possible using a small 8-entry signature buffer.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reducing energy consumption has become one of the major challenges in designing future computing systems. We propose a novel idea of using program counters to predict I/O activities in the operating system. We present a complete design of program-counter access predictor (PCAP) that dynamically learns the access patterns of applications and predicts when an I/O device can be shut down to save energy. PCAP uses path-based correlation to observe a particular sequence of program counters leading to each idle period, and predicts future occurrences of that idle period. PCAP differs from previously proposed shutdown predictors in its ability to: (1) correlate I/O operations to particular behavior of the applications and users, (2) carry prediction information across multiple executions of the applications, and (3) attain better energy savings while incurring low mispredictions.
{"title":"Program counter based techniques for dynamic power management","authors":"C. Gniady, Y. C. Hu, Yung-Hsiang Lu","doi":"10.1109/HPCA.2004.10021","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10021","url":null,"abstract":"Reducing energy consumption has become one of the major challenges in designing future computing systems. We propose a novel idea of using program counters to predict I/O activities in the operating system. We present a complete design of program-counter access predictor (PCAP) that dynamically learns the access patterns of applications and predicts when an I/O device can be shut down to save energy. PCAP uses path-based correlation to observe a particular sequence of program counters leading to each idle period, and predicts future occurrences of that idle period. PCAP differs from previously proposed shutdown predictors in its ability to: (1) correlate I/O operations to particular behavior of the applications and users, (2) carry prediction information across multiple executions of the applications, and (3) attain better energy savings while incurring low mispredictions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}