2006 IEEE International Symposium on Performance Analysis of Systems and Software最新文献

Characterizing the branch misprediction penalty 描述分支错误预测惩罚

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620789

Stijn Eyerman, James E. Smith, L. Eeckhout

Despite years of study, branch mispredictions remain as a significant performance impediment in pipelined superscalar processors. In general, the branch misprediction penalty can be substantially larger than the frontend pipeline length (which is often equated with the misprediction penalty). We identify and quantify five contributors to the branch misprediction penalty: (i) the frontend pipeline length, (ii) the number of instructions since the last miss event (branch misprediction, I-cache miss, long D-cache miss)-this is related to the burstiness of miss events, (iii) the inherent ILP of the program, (iv) the functional unit latencies, and (v) the number of short (LI) D-cache misses. The characterizations done in this paper are driven by 'interval analysis', an analytical approach that models superscalar processor performance as a sequence of inter-miss intervals.

尽管经过多年的研究，分支错误预测仍然是流水线超标量处理器中一个重要的性能障碍。通常，分支错误预测的损失可能比前端管道的长度要大得多(前端管道的长度通常等同于错误预测的损失)。我们确定并量化了分支错误预测惩罚的五个因素:(i)前端管道长度，(ii)自最后一次错过事件(分支错误预测，i -缓存错过，长d -缓存错过)以来的指令数量-这与错过事件的爆发有关，(iii)程序的固有ILP， (iv)功能单元延迟，以及(v)短(LI) d -缓存错过的数量。本文中所做的表征是由“区间分析”驱动的，这是一种将超标量处理器性能建模为间隔间隔序列的分析方法。

引用次数: 59

Performance modeling and prediction for scientific Java applications 科学Java应用程序的性能建模和预测

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620804

Rui Zhang, Zoran Budimlic, K. Kennedy

With the expansion of the Internet, the grid has become an attractive platform for scientific computing. Java, with a platform-independent execution model and built-in support for distributed computing is an inviting choice for implementation of applications intended for grid execution. Recent work has shown that an accurate performance model combined with a load-balancing scheduling strategy can significantly improve the performance of distributed applications on a heterogeneous computing platform, such as the grid. However, current performance modeling techniques are not suitable for Java applications, as the virtual machine execution model presents several difficulties: 1) a significant amount of time is spent on compilation at the beginning of the execution, 2) the virtual machine continuously profiles and recompiles the code during the execution, 3) garbage collection can have unpredictable effects on memory hierarchy, 4) some applications can spend more time garbage collecting than computing for certain heap sizes and 5) small variations in virtual machine implementation can have a large impact on the application's behavior. In this paper, we present a practical profile-based strategy for performance modeling of Java scientific applications intended for execution on the grid. We introduce two novel concepts for the Java execution model: point of predictability (PoP) and point of unpredictability (PoU). PoP accounts for the volatile nature of the effects of the virtual machine on execution time for small problem sizes. PoU accounts for the effects of garbage collection on certain applications that have a memory footprint that approaches the total heap size. We present an algorithm for determining PoP and PoU for Java applications, given the hardware platform, virtual machine and heap size. We also present a code-instrumentation-based mechanism for building the algorithm complexity model for a given application. We introduce a technique for calibrating this model that is able to accurately predict the execution time of Java programs for problem sizes between PoP and PoU. Our preliminary experiments show that techniques can achieve load balancing with more than 90% average CPU utilization.

随着互联网的发展，网格已经成为一个有吸引力的科学计算平台。Java具有独立于平台的执行模型和对分布式计算的内置支持，是实现用于网格执行的应用程序的诱人选择。最近的研究表明，精确的性能模型与负载平衡调度策略相结合，可以显著提高异构计算平台(如网格)上分布式应用程序的性能。然而，当前的性能建模技术并不适合Java应用程序，因为虚拟机执行模型存在以下几个困难:1)在执行开始时，大量的时间花在编译上，2)虚拟机在执行过程中不断地配置和重新编译代码，3)垃圾收集可能对内存层次结构产生不可预测的影响，4)某些应用程序在垃圾收集上花费的时间可能比特定堆大小的计算时间更多，5)虚拟机实现中的小变化可能对应用程序的行为产生很大的影响。在本文中，我们提出了一种实用的基于概要文件的策略，用于在网格上执行的Java科学应用程序的性能建模。我们为Java执行模型引入两个新概念:可预测性点(PoP)和不可预测性点(PoU)。PoP解释了小问题规模时虚拟机对执行时间影响的不稳定性。PoU考虑了垃圾收集对某些内存占用接近堆总大小的应用程序的影响。在给定硬件平台、虚拟机和堆大小的情况下，我们提出了一种用于确定Java应用程序的PoP和PoU的算法。我们还提出了一种基于代码工具的机制，用于为给定应用程序构建算法复杂性模型。我们介绍了一种校准该模型的技术，该模型能够准确地预测PoP和PoU之间问题大小的Java程序的执行时间。我们的初步实验表明，这些技术可以在平均CPU利用率超过90%的情况下实现负载平衡。

{"title":"Performance modeling and prediction for scientific Java applications","authors":"Rui Zhang, Zoran Budimlic, K. Kennedy","doi":"10.1109/ISPASS.2006.1620804","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620804","url":null,"abstract":"With the expansion of the Internet, the grid has become an attractive platform for scientific computing. Java, with a platform-independent execution model and built-in support for distributed computing is an inviting choice for implementation of applications intended for grid execution. Recent work has shown that an accurate performance model combined with a load-balancing scheduling strategy can significantly improve the performance of distributed applications on a heterogeneous computing platform, such as the grid. However, current performance modeling techniques are not suitable for Java applications, as the virtual machine execution model presents several difficulties: 1) a significant amount of time is spent on compilation at the beginning of the execution, 2) the virtual machine continuously profiles and recompiles the code during the execution, 3) garbage collection can have unpredictable effects on memory hierarchy, 4) some applications can spend more time garbage collecting than computing for certain heap sizes and 5) small variations in virtual machine implementation can have a large impact on the application's behavior. In this paper, we present a practical profile-based strategy for performance modeling of Java scientific applications intended for execution on the grid. We introduce two novel concepts for the Java execution model: point of predictability (PoP) and point of unpredictability (PoU). PoP accounts for the volatile nature of the effects of the virtual machine on execution time for small problem sizes. PoU accounts for the effects of garbage collection on certain applications that have a memory footprint that approaches the total heap size. We present an algorithm for determining PoP and PoU for Java applications, given the hardware platform, virtual machine and heap size. We also present a code-instrumentation-based mechanism for building the algorithm complexity model for a given application. We introduce a technique for calibrating this model that is able to accurately predict the execution time of Java programs for problem sizes between PoP and PoU. Our preliminary experiments show that techniques can achieve load balancing with more than 90% average CPU utilization.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130391958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Comparing simulation techniques for microarchitecture-aware floorplanning 微架构感知平面规划的仿真技术比较

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620792

Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar

Due to the long simulation times of the reference input sets, microarchitects resort to alternative techniques to speed up cycle-accurate simulations. However, the reduction in the runtimes comes with an associated loss of accuracy in replicating the characteristics of the reference sets. In addition, the effect of these inaccuracies on the overall performance can vary across different microarchitecture optimizations or enhancements. In this work, we study and compare two such techniques, reduced input sets and statistical sampling, in the context of microarchitecture-aware floorplanning, a physical design stage, where the objective is to find an IPC-optimal global placement of the blocks of a microprocessor. The variation in the IPC results due the insertion of additional flip-flops on some across-chip wires of the processor that have multicycle delays in nanometer technology nodes. The objective of IPC-aware floorplanning is to minimize the amount of pipelining required by the system buses that are critical in determining the system performance. Our results indicate that, although the two techniques exhibit contrasting behavior in quantifying the criticality of bus latencies, the ensuing floorplanning optimization process results in almost identical performance improvements for both reduced input sets and sampling. The reason behind this is that, for discrete optimization problems such as IPC-aware floorplanning, a reasonably accurate relative ordering of performance bottlenecks is sufficient, absolute accuracy is not necessary.

由于参考输入集的模拟时间较长，微架构师采用替代技术来加速周期精确的模拟。然而，运行时的减少带来了复制参考集特征的准确性损失。此外，这些不准确性对整体性能的影响可能因不同的微体系结构优化或增强而异。在这项工作中，我们研究并比较了两种这样的技术，减少输入集和统计抽样，在微架构感知平面规划的背景下，物理设计阶段，其目标是找到ipc最优的微处理器块的全局放置。IPC的变化是由于在处理器的一些跨芯片导线上插入了额外的触发器，这些触发器在纳米技术节点中具有多周期延迟。ipc感知地板规划的目标是最小化系统总线所需的流水线数量，这对确定系统性能至关重要。我们的研究结果表明，尽管这两种技术在量化总线延迟的重要性方面表现出截然不同的行为，但随后的布局优化过程在减少输入集和采样方面的性能改进几乎相同。这背后的原因是，对于离散优化问题，如ipc感知地板规划，性能瓶颈的合理准确的相对排序就足够了，绝对的准确性是不必要的。

{"title":"Comparing simulation techniques for microarchitecture-aware floorplanning","authors":"Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar","doi":"10.1109/ISPASS.2006.1620792","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620792","url":null,"abstract":"Due to the long simulation times of the reference input sets, microarchitects resort to alternative techniques to speed up cycle-accurate simulations. However, the reduction in the runtimes comes with an associated loss of accuracy in replicating the characteristics of the reference sets. In addition, the effect of these inaccuracies on the overall performance can vary across different microarchitecture optimizations or enhancements. In this work, we study and compare two such techniques, reduced input sets and statistical sampling, in the context of microarchitecture-aware floorplanning, a physical design stage, where the objective is to find an IPC-optimal global placement of the blocks of a microprocessor. The variation in the IPC results due the insertion of additional flip-flops on some across-chip wires of the processor that have multicycle delays in nanometer technology nodes. The objective of IPC-aware floorplanning is to minimize the amount of pipelining required by the system buses that are critical in determining the system performance. Our results indicate that, although the two techniques exhibit contrasting behavior in quantifying the criticality of bus latencies, the ensuing floorplanning optimization process results in almost identical performance improvements for both reduced input sets and sampling. The reason behind this is that, for discrete optimization problems such as IPC-aware floorplanning, a reasonably accurate relative ordering of performance bottlenecks is sufficient, absolute accuracy is not necessary.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127210181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Revisiting the performance impact of branch predictor latencies 回顾分支预测器延迟对性能的影响

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620790

G. Loh

Branch predictors play a critical role in the performance of modern processors, and the prediction accuracy is known to be the most important attribute of such predictors. However, the latency of the predictor can also have a profound impact on performance as well. In past studies that have considered branch prediction latency, most only consider the latency required to make a prediction. However, in deeply pipelined processors, the latency between prediction and update can also greatly affect performance. In this study, we revisit the performance impact of both of these latencies and demonstrate that update latency can also have a significant impact on performance. We then describe two techniques, multi-overriding and hierarchical updates, to address both latencies which provide 4.4% and 5.7% IPC improvements on moderately (20-stage) and deeply (40-stage) pipelined processors, respectively, for minimal hardware complexity.

分支预测器在现代处理器的性能中起着至关重要的作用，预测精度是分支预测器最重要的属性。然而，预测器的延迟也会对性能产生深远的影响。在过去考虑分支预测延迟的研究中，大多数只考虑进行预测所需的延迟。然而，在深度流水线处理器中，预测和更新之间的延迟也会极大地影响性能。在本研究中，我们将重新审视这两种延迟对性能的影响，并证明更新延迟也会对性能产生重大影响。然后，我们描述了两种技术，多重覆盖和分层更新，以解决这两种延迟，分别在中等(20阶段)和深度(40阶段)流水线处理器上提供4.4%和5.7%的IPC改进，以实现最小的硬件复杂性。

引用次数: 12

Friendly fire: understanding the effects of multiprocessor prefetches 误伤:了解多处理器预取的影响

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620802

Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti

Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes - sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching - in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interferes with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.

现代处理器试图通过预测未来的引用并从内存中预取这些块来克服不断增加的内存延迟。对于单处理器系统，预取方案的行为和可能的负面影响是相当容易理解的。然而，在多处理器系统中，预取可以从其他处理器窃取共享块的读和/或写权限，从而导致权限抖动和整体性能下降。在本文中，我们提出了一种分类法，对多处理器预取的效果进行分类。我们还介绍了在基于总线的多处理器系统中，四种不同的硬件预取方案——顺序预取、内容导向数据预取、错误路径预取和独占预取的效果。我们表明，准确度和覆盖率不足以描述多处理器中的预取;相反，我们还需要了解哪些预取会干扰远程处理器。我们给出了各种预取算法在没有发布有害预取的情况下的性能上限，并提出了可以实现这一目标的预取过滤方案。

{"title":"Friendly fire: understanding the effects of multiprocessor prefetches","authors":"Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti","doi":"10.1109/ISPASS.2006.1620802","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620802","url":null,"abstract":"Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes - sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching - in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interferes with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"56 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120839371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

MESA: reducing cache conflicts by integrating static and run-time methods MESA:通过集成静态和运行时方法来减少缓存冲突

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620803

Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang

The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.

MESA (Multicoloring with Embedded biased Associativity)是一种新的缓存索引方案，它将动态页面着色与静态倾斜关联性相结合，以减少小关联度的L2/L3缓存中的冲突。MESA将多个缓存页面(颜色)与每个虚拟内存页面相关联，并使用两级倾斜的关联性，首先将页面映射到缓存的每个银行中的不同颜色，然后将页面的线条分散到各个银行和页面的颜色中。MESA是一种多粒度缓存索引方案，它结合了两个世界的优点:页面着色和倾斜关联。我们还提出了一种新的基于页面重映射的缓存管理方案，该方案使用每个银行中颜色之间的缓存丢失不平衡作为度量来跟踪冲突并触发重映射。我们在有序问题处理器(使用完整的系统模拟)和无序问题处理器(使用SimpleScalar)上使用来自多个应用领域的24个基准测试来评估MESA，并对冲突缺失具有不同程度的敏感性。MESA优于之前提出的歪斜结合性、素模哈希和动态页面着色方案。与4路关联缓存相比，MESA可以提供多达76%的IPC改进。

{"title":"MESA: reducing cache conflicts by integrating static and run-time methods","authors":"Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang","doi":"10.1109/ISPASS.2006.1620803","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620803","url":null,"abstract":"The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131611406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Improved stride prefetching using extrinsic stream characteristics 使用外部流特性改进步幅预取

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620801

H. Al-Sukhni, James Holt, D. Connors

Stride-based prefetching mechanisms exploit regular streams of memory accesses to hide memory latency. While these mechanisms are effective, they can be improved by studying the properties of regular streams. As evidence of this, the establishment of metrics to quantify intrinsic characteristics of regular streams has been shown to enable software-based code optimizations. In this paper we extend previously identified regular stream metrics to quantify extrinsic characteristics of regular streams, and show how these new metrics can be employed to improve the efficiency of stride prefetching. The extrinsic metrics we introduce are stream affinity and stream density. Stream affinity enables prefetching for short streams that were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Finally, we show that using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling prefetch ahead distance (PAD) which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. For a representative set of SPEC2K traces, our techniques consistently outperform our implementation of the closest previously reported stride-based prefetching technique.

基于步进的预取机制利用常规的内存访问流来隐藏内存延迟。虽然这些机制是有效的，但可以通过研究常规流的特性来改进它们。作为这方面的证据，建立度量来量化常规流的内在特征已经被证明能够实现基于软件的代码优化。在本文中，我们扩展了先前确定的规则流度量来量化规则流的外在特征，并展示了如何使用这些新度量来提高步长预取的效率。我们引入的外在指标是流亲和度和流密度。流亲缘性允许对以前被跨步预取机制忽略的短流进行预取。流密度支持一种优先级机制，它可以动态地选择可用的流，支持那些承诺更多未覆盖的流，并提供多个共存流之间的震荡控制。最后，我们表明，结合使用内在和外在流度量可以实现一种新的硬件技术来控制预取提前距离(PAD)，该技术可以动态调整预取启动时间，以更好地实现及时预取，同时最大限度地减少缓存污染。对于具有代表性的SPEC2K轨迹集，我们的技术始终优于我们实现的最接近的先前报道的基于步进的预取技术。

{"title":"Improved stride prefetching using extrinsic stream characteristics","authors":"H. Al-Sukhni, James Holt, D. Connors","doi":"10.1109/ISPASS.2006.1620801","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620801","url":null,"abstract":"Stride-based prefetching mechanisms exploit regular streams of memory accesses to hide memory latency. While these mechanisms are effective, they can be improved by studying the properties of regular streams. As evidence of this, the establishment of metrics to quantify intrinsic characteristics of regular streams has been shown to enable software-based code optimizations. In this paper we extend previously identified regular stream metrics to quantify extrinsic characteristics of regular streams, and show how these new metrics can be employed to improve the efficiency of stride prefetching. The extrinsic metrics we introduce are stream affinity and stream density. Stream affinity enables prefetching for short streams that were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Finally, we show that using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling prefetch ahead distance (PAD) which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. For a representative set of SPEC2K traces, our techniques consistently outperform our implementation of the closest previously reported stride-based prefetching technique.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Critical path analysis of the TRIPS architecture 与贸易有关的知识产权协议》架构的关键路径分析

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620788

R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler

Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.

快速、准确和有效的性能分析对于设计现代处理器架构和提高应用性能至关重要。最近，高并发处理器的发展趋势使得这一目标越来越难以实现。基于模拟器和性能监控器的传统技术无法分析大量并发事件如何相互作用以及它们如何影响性能。先前的研究表明，关键路径分析在解决这一问题方面非常有用。这种分析方法通过依赖关系图对程序的执行进行抽象。通过对图的简单操作，设计人员可以深入了解设计的瓶颈。本文扩展了关键路径分析，以了解下一代高ILP 架构的性能。TRIPS 架构引入了传统超标量架构所不具备的新特性。我们展示了如何通过依赖图来模拟这些特性（特别是执行模型和操作数通信链路）所引入的依赖性约束。我们介绍了一种新算法，该算法可在细粒度级别跟踪关键路径信息，与之前提出的技术相比，性能提高了一个数量级（30 倍）。最后，我们提供了一组选定基准的关键路径明细，并举例说明了我们利用这些信息将一个经过大量手工优化的程序的性能提高了 11%。

{"title":"Critical path analysis of the TRIPS architecture","authors":"R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler","doi":"10.1109/ISPASS.2006.1620788","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620788","url":null,"abstract":"Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126023281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

ATTILA: a cycle-level execution-driven simulator for modern GPU architectures ATTILA:一个用于现代GPU架构的周期级执行驱动模拟器

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620807

Victor Moya Del Barrio, Carlos González, Jordi Roca, Agustín Fernández, R. Espasa

The present work presents a cycle-level execution-driven simulator for modern GPU architectures. We discuss the simulation model used for our GPU simulator, based in the concept of boxes and signals, and the relation between the timing simulator and the functional emulator. The simulation model we use helps to increase the accuracy and reduce the number of errors in the timing simulator while allowing for an easy extensibility of the simulated GPU architecture. We also introduce the OpenGL framework used to feed the simulator with traces from real applications (UT2004, Doom3) and a performance debugging tool (Signal Trace Visualizer). The presented ATTILA simulator supports the simulation of a whole range of GPU configurations and architectures, from the embedded segment to the high end PC segment, supporting both the unified and non unified shader architectural models.

本文提出了一个用于现代GPU架构的周期级执行驱动模拟器。基于盒和信号的概念，讨论了GPU模拟器的仿真模型，以及时序模拟器和功能模拟器之间的关系。我们使用的仿真模型有助于提高精度并减少时序模拟器中的错误数量，同时允许模拟GPU架构的易于扩展。我们还介绍了OpenGL框架，用于向模拟器提供来自实际应用程序的跟踪(UT2004, Doom3)和性能调试工具(Signal Trace Visualizer)。提出的ATTILA模拟器支持从嵌入式部分到高端PC部分的各种GPU配置和架构的仿真，支持统一和非统一的着色器架构模型。

引用次数: 110

Modeling TCAM power for next generation network devices 为下一代网络设备建模TCAM功率

2006 IEEE International Symposium on Performance Analysis of Systems and Software

Pub Date : 2006-03-19 DOI: 10.1109/ISPASS.2006.1620796

B. Agrawal, T. Sherwood

Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.

计算机网络中的应用通常需要对大型数据结构进行高吞吐量访问以进行查找和分类。在网络处理器、通用机器甚至定制的asic上，存在许多高级算法来加速这些搜索原语。但是，使用标准内存支持这些应用程序需要非常仔细地分析访问模式，实现最坏情况下的性能可能相当困难和复杂。如果使用Ternary CAM(内容可寻址内存)在整个数据集上执行完全并行搜索，那么通常可能有一个简单的解决方案。不幸的是，这种并行性意味着芯片的大部分在每个周期中都在切换，导致大量的功率消耗。虽然研究人员已经开始探索管理功耗的新方法，但由于缺乏可用的模型，量化设计方案很困难。在本文中，我们研究了现代TCAM的内部结构，并提出了一个简单而准确的功率模型。我们提出了估算大型TCAM动态功耗的技术。我们使用工业TCAM数据表和先前发表的作品验证了该模型。我们通过改变各种体系结构参数对模型进行了广泛的分析。我们还描述了新的网络算法如何有潜力解决下一代网络设备中日益严重的电源管理问题。

{"title":"Modeling TCAM power for next generation network devices","authors":"B. Agrawal, T. Sherwood","doi":"10.1109/ISPASS.2006.1620796","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620796","url":null,"abstract":"Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125717758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 117