首页 > 最新文献

10th International Symposium on High Performance Computer Architecture (HPCA'04)最新文献

英文 中文
Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization 微处理器设计的小波分析:基于小波的dI/dt表征的经验
R. Joseph, Zhigang Hu, M. Martonosi
As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.
随着微处理器变得越来越复杂,用于分析和预测其行为的技术必须变得越来越严格。我们将小波分析技术应用于现代微处理器的dl/dt估计和控制问题。虽然之前的工作已经考虑了贝叶斯相位分析、马尔可夫分析和其他技术来表征硬件和软件的行为,但据我们所知,之前没有工作使用小波来表征计算机系统。近年来,由于电源电压的急剧下降和CPU电流耗散的相对波动越来越大,dl/dt问题越来越令人烦恼。由于dl/dt问题具有固有的频率依赖性(在大约50-200 MHz的中频范围内最糟糕),因此应用小波等面向频率的技术来理解它是很自然的。我们的工作提出了(i)一种基于离线小波的估计技术,可以准确预测基准引起电压紧急情况的可能性,以及(ii)一种基于小波的在线控制技术,该技术使用关键小波系数来预测和避免即将发生的电压紧急情况。离线估计技术的误差约为0.94%。在线控制技术减少了dl/dt预测中的误报,允许在SPEC基准套件上以低于2.5%的性能开销进行电压控制。
{"title":"Wavelet analysis for microprocessor design: experiences with wavelet-based dI/dt characterization","authors":"R. Joseph, Zhigang Hu, M. Martonosi","doi":"10.1109/HPCA.2004.10027","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10027","url":null,"abstract":"As microprocessors become increasingly complex, the techniques used to analyze and predict their behavior must become increasingly rigorous. We apply wavelet analysis techniques to the problem of dl/dt estimation and control in modern microprocessors. While prior work has considered Bayesian phase analysis, Markov analysis, and other techniques to characterize hardware and software behavior, we know of no prior work using wavelets for characterizing computer systems. The dl/dt problem has been increasingly vexing in recent years, because of aggressive drops in supply voltage and increasingly large relative fluctuations in CPU current dissipation. Because the dl/dt problem has natural frequency dependence (it is worst in the mid-frequency range of roughly 50-200 MHz) it is natural to apply frequency-oriented techniques like wavelets to understand it. Our work proposes (i) an offline wavelet-based estimation technique that can accurately predict a benchmark's likelihood of causing voltage emergencies, and (ii) an online wavelet-based control technique that uses key wavelet coefficients to predict and avert impending voltage emergencies. The offline estimation technique works with roughly 0.94% error. The online control technique reduces false positives in dl/dt prediction, allowing, voltage control to occur with less than 2.5% performance overhead on the SPEC benchmark suite.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115091195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Processor Aware Anticipatory Prefetching in Loops 循环中处理器感知的预期预取
Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai
As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.
随着微处理器速度的提高,执行时间的很大一部分通常会因为缓存丢失而损失。在ultrasparc - iii等处理器中,这种损失可能特别严重,因为它们按顺序执行,并且在缓存丢失时阻塞。这样的处理器在很大程度上依赖于编译器来减少延迟并实现高性能。本文描述了一种能够感知目标处理器特定预取行为的软件预取编译技术。实现的目标是包含控制流和跨行或不规则内存访问模式的循环。一种能够处理复杂下标表达式的两阶段局部性分析用于增强预取候选对象的识别。预取指令是在仔细考虑目标系统中的预取行为的情况下调度的。与之前的实现相比,我们的技术在几何平均值上提高了9%,在单个测试中提高了44%,在Sun的第一个基于UltraSPARC-IIICu的SPEC CPU2000提交中[5],并已用于迄今为止所有后来的提交。
{"title":"Processor Aware Anticipatory Prefetching in Loops","authors":"Spiros Kalogeropulos, M. Rajagopalan, V. Rao, Yonghong Song, P. Tirumalai","doi":"10.1109/HPCA.2004.10029","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10029","url":null,"abstract":"As microprocessor speeds increase, a large fraction of the execution time is often lost to cache miss penalties. This loss can be particularly severe in processors such as the UltraSPARC-IIICu which have in-order execution and block on cache misses. Such processors rely greatly on the compiler to reduce stalls and achieve high performance. This paper describes a compiler technique for software prefetching that is aware of the specific prefetch behaviors of the target processor. The implementation targets loops containing control-flow and strided or irregular memory access patterns. A two phase locality analysis, capable of handling complex subscript expressions, is used for enhanced identification of prefetch candidates. Prefetch instructions are scheduled with careful consideration of the prefetch behaviors in the target system. Compared to a previous implementation, our technique produced performance improvements of 9% on the geometric mean, and up to 44% on individual tests, in Sun’s first UltraSPARC-IIICu based SPEC CPU2000 submission [5] and has been used in all later submissions to date.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122325564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Reducing the Scheduling Critical Cycle Using Wakeup Prediction 利用唤醒预测减少调度关键周期
Todd E. Ehrhart, Sanjay J. Patel
For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.
为了获得最高的性能,现代微处理器必须能够在选择执行的同一周期内确定一条指令是否准备就绪。这创造了一个涉及唤醒和选择的逻辑循环。然而,静态指令等待唤醒的时间几乎没有动态变化。这个想法被用来构建一个机器,它可以预测唤醒时间,并且过早执行的指令会被重放。这种形式的自调度通过以额外的重放为代价消除唤醒逻辑来减少关键周期。然而,重播和其他管道效应会影响错误预测的成本。为了解决这个问题,在预测唤醒时间上增加了一个余量,以减少重播的可能性。这种余量可以与单个指令或全局状态相关联,并通过梯度下降最小搜索技术动态调整。当处理器负载较低时,预测可能更加主动——增加了重播的机会,但提高了性能,因此使用处理器负载作为反馈参数动态调整预测器的主动。
{"title":"Reducing the Scheduling Critical Cycle Using Wakeup Prediction","authors":"Todd E. Ehrhart, Sanjay J. Patel","doi":"10.1109/HPCA.2004.10016","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10016","url":null,"abstract":"For highest performance, a modern microprocessor must be able to determine if an instruction is ready in the same cycle in which it is to be selected for execution. This creates a cycle of logic involving wakeup and select. However, the time a static instruction spends waiting for wakeup shows little dynamic variance. This idea is used to build a machine where wakeup times are predicted, and instructions executed too early are replayed. This form of self-scheduling reduces the critical cycle by eliminating the wakeup logic at the expense of additional replays. However, replays and other pipeline effects affect the cost of misprediction. To solve this, an allowance is added to the predicted wakeup time to decrease the probability of a replay. This allowance may be associated with individual instructions or the global state, and is dynamically adjusted by a gradient-descent minimum-searching technique. When processor load is low, prediction may be more aggressive — increasing the chance of replays, but increasing performance, so the aggressiveness of the predictor is dynamically adjusted using processor load as a feedback parameter.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128316964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor Pentium/spl reg/ M微处理器上TCP/IP数据包处理的体系结构表征
S. Makineni, R. Iyer
A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.
大多数当前和下一代服务器应用程序(Web服务、电子商务、存储等)都采用TCP/IP作为首选的通信协议。因此,这些应用程序的性能在很大程度上依赖于终端节点内有效的TCP/IP数据包处理。随着这些应用的带宽需求在不久的将来从100 Mbps增长到1 Gbps,再到10 Gbps,这种依赖性会变得更大。受此启发,我们专注于以下方面:(a)了解TCP/IP处理的各种模式的性能行为,(b)分析TCP/IP数据包处理的底层架构特征,以及(c)在实际工作负载中量化TCP/IP数据包处理组件的计算需求。我们通过对运行微软Windows* Server 2003操作系统的英特尔最先进的低功耗Pentium/spl reg/ M微处理器的数据包处理性能进行深入分析来实现这些目标。我们的一些关键观察是——(i) TCP/IP操作模式可以显著影响性能要求,(ii)与接收端处理相比,发送端处理在很大程度上是计算密集型的,而接收端处理更受内存限制,(iii)发送/接收数据包的计算需求可以构成商业服务器工作负载的重要组成部分(28%至40%)。从我们的分析中,我们还讨论了架构以及与堆栈相关的改进,这些改进可以帮助实现更高的服务器网络吞吐量并提高应用程序性能。
{"title":"Architectural characterization of TCP/IP packet processing on the Pentium/spl reg/ M microprocessor","authors":"S. Makineni, R. Iyer","doi":"10.1109/HPCA.2004.10024","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10024","url":null,"abstract":"A majority of the current and next generation server applications (Web services, e-commerce, storage, etc.) employ TCP/IP as the communication protocol of choice. As a result, the performance of these applications is heavily dependent on the efficient TCP/IP packet processing within the termination nodes. This dependency becomes even greater as the bandwidth needs of these applications grow from 100 Mbps to 1 Gbps to 10 Gbps in the near future. Motivated by this, we focus on the following: (a) to understand the performance behavior of the various modes of TCP/IP processing, (b) to analyze the underlying architectural characteristics of TCP/IP packet processing and (c) to quantify the computational requirements of the TCP/IP packet processing component within realistic workloads. We achieve these goals by performing an in-depth analysis of packet processing performance on Intel's state-of-the-art low power Pentium/spl reg/ M microprocessor running the Microsoft Windows* Server 2003 operating system. Some of our key observations are - (i) that the mode of TCP/IP operation can significantly affect the performance requirements, (ii) that transmit-side processing is largely compute-intensive as compared to receive-side processing which is more memory-bound and (iii) that the computational requirements for sending/receiving packets can form a substantial component (28% to 40%) of commercial server workloads. From our analysis, we also discuss architectural as well as stack-related improvements that can help achieve higher server network throughput and result in improved application performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115598243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Creating converged trace schedules using string matching 使用字符串匹配创建聚合跟踪计划
S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder
We focus on generating efficient software pipelined schedules for in-order machines, which we call converged trace schedules. For a candidate loop, we form a string of trace block identifiers by hashing together addresses of aggressively scheduled instructions from multiple iterations of a loop. In this process, the loop is unrolled and scheduled until we identify a repeating pattern in the string. Instructions corresponding to this repeating pattern form the kernel for our software pipelined schedule. We evaluate this approach to create aggressive schedules by using it in dynamic hardware and software optimization systems for an in-order architecture.
我们专注于为有序机器生成高效的软件流水线调度,我们称之为聚合跟踪调度。对于候选循环,我们通过将来自循环多次迭代的主动调度指令的地址散列在一起,形成一串跟踪块标识符。在这个过程中,循环被展开和调度,直到我们在字符串中识别出重复的模式。与这种重复模式相对应的指令构成了我们软件流水线调度的内核。我们通过在有序架构的动态硬件和软件优化系统中使用它来评估这种方法来创建积极的时间表。
{"title":"Creating converged trace schedules using string matching","authors":"S. Narayanasamy, Yuanfang Hu, S. Sair, B. Calder","doi":"10.1109/HPCA.2004.10012","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10012","url":null,"abstract":"We focus on generating efficient software pipelined schedules for in-order machines, which we call converged trace schedules. For a candidate loop, we form a string of trace block identifiers by hashing together addresses of aggressively scheduled instructions from multiple iterations of a loop. In this process, the loop is unrolled and scheduled until we identify a repeating pattern in the string. Instructions corresponding to this repeating pattern form the kernel for our software pipelined schedule. We evaluate this approach to create aggressive schedules by using it in dynamic hardware and software optimization systems for an in-order architecture.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124502683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accurate and complexity-effective spatial pattern prediction 精确和复杂有效的空间格局预测
Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos
Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.
最近的研究表明,无论是在程序内部还是跨程序,缓存的空间使用都存在很大的差异。不幸的是,传统的缓存通常使用固定的缓存行大小来平衡对空间和时间局部性的利用,并避免过高的缓存填充带宽需求。因此,传统缓存无法利用空间变化导致次优性能和不必要的缓存功耗。我们描述了空间模式预测器(SPP),这是一种经济有效的硬件机制,可以在运行时准确预测空间组(即内存中的连续数据区域)中的引用模式。实现精确而低成本的SPP设计的关键观察是,空间模式与缓存线路内的指令地址和数据参考偏移量密切相关。我们只需要少量的预测器内存来存储预测的模式。对64字节行的64-Kbyte 2路集合关联Ll数据缓存的仿真结果表明:(1) 256项无标签的直接映射SPP平均预测覆盖率为95%,过度预测模式仅为8%;(2)假设采用70 nm工艺技术,SPP有助于将基础缓存中的泄漏能量平均减少41%,导致不到1%的性能下降;(3)使用SPP预取高达512字节的空间组,平均可将执行时间提高33%,最多可提高两倍。
{"title":"Accurate and complexity-effective spatial pattern prediction","authors":"Chi F. Chen, Se-Hyun Yang, B. Falsafi, Andreas Moshovos","doi":"10.1109/HPCA.2004.10010","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10010","url":null,"abstract":"Recent research suggests that there are large variations in a cache's spatial usage, both within and across programs. Unfortunately, conventional caches typically employ fixed cache line sizes to balance the exploitation of spatial and temporal locality, and to avoid prohibitive cache fill bandwidth demands. The resulting inability of conventional caches to exploit spatial variations leads to suboptimal performance and unnecessary cache power dissipation. We describe the spatial pattern predictor (SPP), a cost-effective hardware mechanism that accurately predicts reference patterns within a spatial group (i.e., a contiguous region of data in memory) at runtime. The key observation enabling an accurate, yet low-cost, SPP design is that spatial patterns correlate well with instruction addresses and data reference offsets within a cache line. We require only a small amount of predictor memory to store the predicted patterns. Simulation results for a 64-Kbyte 2-way set-associative Ll data cache with 64-byte lines show that: (1) a 256-entry tag-less direct-mapped SPP can achieve, on average, a prediction coverage of 95%, over-predicting the patterns by only 8%, (2) assuming a 70 nm process technology, the SPP helps reduce leakage energy in the base cache by 41% on average, incurring less than 1% performance degradation, and (3) prefetching spatial groups of up to 512 bytes using SPP improves execution time by 33% on average and up to a factor of two.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Exploiting the cache capacity of a single-chip multi-core processor with execution migration 利用带有执行迁移的单芯片多核处理器的缓存容量
P. Michaud
We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall on-chip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called "splittability", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.
我们建议修改传统的单片多核,使顺序程序可以在执行过程中从一个核自动迁移到另一个核。执行迁移的目标是利用整个片上缓存容量。介绍了一种将缓存线自动分配到多个缓存上的方法——关联算法。我们表明,在显示称为“可分割性”的属性的工作集上,可以将缓存丢失交换为迁移。我们的实验结果表明,所提出的方法有可能提高某些顺序程序的性能,而不会显著降低其他程序的性能。
{"title":"Exploiting the cache capacity of a single-chip multi-core processor with execution migration","authors":"P. Michaud","doi":"10.1109/HPCA.2004.10026","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10026","url":null,"abstract":"We propose to modify a conventional single-chip multicore so that a sequential program can migrate from one core to another automatically during execution. The goal of execution migration is to take advantage of the overall on-chip cache capacity. We introduce the affinity algorithm, a method for distributing cache lines automatically on several caches. We show that on working-sets exhibiting a property called \"splittability\", it is possible to trade cache misses for migrations. Our experimental results indicate that the proposed method has a potential for improving the performance of certain sequential programs, without degrading significantly the performance of others.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122188916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors 节约的障碍:共享内存多处理器中的能量感知同步
Jian Li, José F. Martínez, Michael C. Huang
Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.
许多研究都致力于使微处理器节能。然而,很少有人关注多处理器环境,在这种环境中,由于计算的协作性质,每个处理器中最节能的执行可能不会转化为最节能的总体执行。我们提出了节俭的屏障,一种硬件软件的方法,以节省能源的并行应用,表现出屏障同步不平衡。提前到达节俭屏障的线程根据预测的屏障失速时间和其他因素在现有的低功耗处理器睡眠状态中进行选择。我们利用一致性协议并提出小型硬件扩展来实现这些休眠线程的及时唤醒,最大限度地节省能源,同时最大限度地减少对性能的影响。
{"title":"The thrifty barrier: energy-aware synchronization in shared-memory multiprocessors","authors":"Jian Li, José F. Martínez, Michael C. Huang","doi":"10.1109/HPCA.2004.10018","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10018","url":null,"abstract":"Much research has been devoted to making microprocessors energy-efficient. However, little attention has been paid to multiprocessor environments where, due to the cooperative nature of the computation, the most energy-efficient execution in each processor may not translate into the most energy-efficient overall execution. We present the thrifty barrier, a hardware-software approach to saving energy in parallel applications that exhibit barrier synchronization imbalance. Threads that arrive early to a thrifty barrier pick among existing low-power processor sleep states based on predicted barrier stall time and other factors. We leverage the coherence protocol and propose small hardware extensions to achieve timely wake-up of these dormant threads, maximizing energy savings while minimizing the impact on performance.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134341284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 126
Signature buffer: bridging performance gap between registers and caches 签名缓冲区:弥合寄存器和缓存之间的性能差距
Lu Peng, J. Peir, K. Lai
Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. We introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small signature buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (instruction-per-cycle) improvement of 13-18% is possible using a small 8-entry signature buffer.
生产者指令和消费者指令之间通过内存进行的数据通信会产生额外的延迟,从而降低处理器的性能。为了避免地址计算,我们引入了一种具有新颖寻址机制的新存储介质。每个加载和存储都被分配一个用于访问新存储的签名,而不是内存地址。一个签名由基寄存器的颜色和它的位移值组成。每当寄存器被更新时,就会给寄存器分配一个唯一的颜色。当两个内存指令具有相同的签名时,它们指向相同的内存位置。这种内存签名可以在处理器管道的早期形成。可以建立一个由内存签名寻址的小签名缓冲区,以允许存储和加载绕过正常的内存层次结构,实现快速的数据通信。使用SPEC2000整数基准测试基于Alpha 21264类管道的性能评估表明,使用一个小的8项签名缓冲区可以将IPC(每周期指令数)提高13-18%。
{"title":"Signature buffer: bridging performance gap between registers and caches","authors":"Lu Peng, J. Peir, K. Lai","doi":"10.1109/HPCA.2004.10020","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10020","url":null,"abstract":"Data communications between producer instructions and consumer instructions through memory incur extra delays that degrade processor performance. We introduce a new storage media with a novel addressing mechanism to avoid address calculations. Instead of a memory address, each load and store is assigned a signature for accessing the new storage. A signature consists of the color of the base register along with its displacement value. A unique color is assigned to a register whenever the register is updated. When two memory instructions have the same signature, they address to the same memory location. This memory signature can be formed early in the processor pipeline. A small signature buffer, addressed by the memory signature, can be established to permit stores and loads bypassing normal memory hierarchy for fast data communication. Performance evaluations based on an Alpha 21264-like pipeline using SPEC2000 integer benchmarks show that an IPC (instruction-per-cycle) improvement of 13-18% is possible using a small 8-entry signature buffer.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128912776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Program counter based techniques for dynamic power management 基于程序计数器的动态电源管理技术
C. Gniady, Y. C. Hu, Yung-Hsiang Lu
Reducing energy consumption has become one of the major challenges in designing future computing systems. We propose a novel idea of using program counters to predict I/O activities in the operating system. We present a complete design of program-counter access predictor (PCAP) that dynamically learns the access patterns of applications and predicts when an I/O device can be shut down to save energy. PCAP uses path-based correlation to observe a particular sequence of program counters leading to each idle period, and predicts future occurrences of that idle period. PCAP differs from previously proposed shutdown predictors in its ability to: (1) correlate I/O operations to particular behavior of the applications and users, (2) carry prediction information across multiple executions of the applications, and (3) attain better energy savings while incurring low mispredictions.
降低能耗已成为设计未来计算系统的主要挑战之一。我们提出了一个使用程序计数器来预测操作系统中的I/O活动的新想法。我们提出了一个完整的程序计数器访问预测器(PCAP)的设计,它可以动态学习应用程序的访问模式,并预测何时可以关闭I/O设备以节省能源。PCAP使用基于路径的相关性来观察导致每个空闲期的特定程序计数器序列,并预测该空闲期的未来出现情况。PCAP与之前提出的关闭预测器的不同之处在于:(1)将I/O操作与应用程序和用户的特定行为相关联,(2)在应用程序的多次执行中携带预测信息,以及(3)在产生低错误预测的同时实现更好的节能。
{"title":"Program counter based techniques for dynamic power management","authors":"C. Gniady, Y. C. Hu, Yung-Hsiang Lu","doi":"10.1109/HPCA.2004.10021","DOIUrl":"https://doi.org/10.1109/HPCA.2004.10021","url":null,"abstract":"Reducing energy consumption has become one of the major challenges in designing future computing systems. We propose a novel idea of using program counters to predict I/O activities in the operating system. We present a complete design of program-counter access predictor (PCAP) that dynamically learns the access patterns of applications and predicts when an I/O device can be shut down to save energy. PCAP uses path-based correlation to observe a particular sequence of program counters leading to each idle period, and predicts future occurrences of that idle period. PCAP differs from previously proposed shutdown predictors in its ability to: (1) correlate I/O operations to particular behavior of the applications and users, (2) carry prediction information across multiple executions of the applications, and (3) attain better energy savings while incurring low mispredictions.","PeriodicalId":145009,"journal":{"name":"10th International Symposium on High Performance Computer Architecture (HPCA'04)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2004-02-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128549250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
期刊
10th International Symposium on High Performance Computer Architecture (HPCA'04)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1