Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620789
Stijn Eyerman, James E. Smith, L. Eeckhout
Despite years of study, branch mispredictions remain as a significant performance impediment in pipelined superscalar processors. In general, the branch misprediction penalty can be substantially larger than the frontend pipeline length (which is often equated with the misprediction penalty). We identify and quantify five contributors to the branch misprediction penalty: (i) the frontend pipeline length, (ii) the number of instructions since the last miss event (branch misprediction, I-cache miss, long D-cache miss)-this is related to the burstiness of miss events, (iii) the inherent ILP of the program, (iv) the functional unit latencies, and (v) the number of short (LI) D-cache misses. The characterizations done in this paper are driven by 'interval analysis', an analytical approach that models superscalar processor performance as a sequence of inter-miss intervals.
尽管经过多年的研究,分支错误预测仍然是流水线超标量处理器中一个重要的性能障碍。通常,分支错误预测的损失可能比前端管道的长度要大得多(前端管道的长度通常等同于错误预测的损失)。我们确定并量化了分支错误预测惩罚的五个因素:(i)前端管道长度,(ii)自最后一次错过事件(分支错误预测,i -缓存错过,长d -缓存错过)以来的指令数量-这与错过事件的爆发有关,(iii)程序的固有ILP, (iv)功能单元延迟,以及(v)短(LI) d -缓存错过的数量。本文中所做的表征是由“区间分析”驱动的,这是一种将超标量处理器性能建模为间隔间隔序列的分析方法。
{"title":"Characterizing the branch misprediction penalty","authors":"Stijn Eyerman, James E. Smith, L. Eeckhout","doi":"10.1109/ISPASS.2006.1620789","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620789","url":null,"abstract":"Despite years of study, branch mispredictions remain as a significant performance impediment in pipelined superscalar processors. In general, the branch misprediction penalty can be substantially larger than the frontend pipeline length (which is often equated with the misprediction penalty). We identify and quantify five contributors to the branch misprediction penalty: (i) the frontend pipeline length, (ii) the number of instructions since the last miss event (branch misprediction, I-cache miss, long D-cache miss)-this is related to the burstiness of miss events, (iii) the inherent ILP of the program, (iv) the functional unit latencies, and (v) the number of short (LI) D-cache misses. The characterizations done in this paper are driven by 'interval analysis', an analytical approach that models superscalar processor performance as a sequence of inter-miss intervals.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116947002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620804
Rui Zhang, Zoran Budimlic, K. Kennedy
With the expansion of the Internet, the grid has become an attractive platform for scientific computing. Java, with a platform-independent execution model and built-in support for distributed computing is an inviting choice for implementation of applications intended for grid execution. Recent work has shown that an accurate performance model combined with a load-balancing scheduling strategy can significantly improve the performance of distributed applications on a heterogeneous computing platform, such as the grid. However, current performance modeling techniques are not suitable for Java applications, as the virtual machine execution model presents several difficulties: 1) a significant amount of time is spent on compilation at the beginning of the execution, 2) the virtual machine continuously profiles and recompiles the code during the execution, 3) garbage collection can have unpredictable effects on memory hierarchy, 4) some applications can spend more time garbage collecting than computing for certain heap sizes and 5) small variations in virtual machine implementation can have a large impact on the application's behavior. In this paper, we present a practical profile-based strategy for performance modeling of Java scientific applications intended for execution on the grid. We introduce two novel concepts for the Java execution model: point of predictability (PoP) and point of unpredictability (PoU). PoP accounts for the volatile nature of the effects of the virtual machine on execution time for small problem sizes. PoU accounts for the effects of garbage collection on certain applications that have a memory footprint that approaches the total heap size. We present an algorithm for determining PoP and PoU for Java applications, given the hardware platform, virtual machine and heap size. We also present a code-instrumentation-based mechanism for building the algorithm complexity model for a given application. We introduce a technique for calibrating this model that is able to accurately predict the execution time of Java programs for problem sizes between PoP and PoU. Our preliminary experiments show that techniques can achieve load balancing with more than 90% average CPU utilization.
{"title":"Performance modeling and prediction for scientific Java applications","authors":"Rui Zhang, Zoran Budimlic, K. Kennedy","doi":"10.1109/ISPASS.2006.1620804","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620804","url":null,"abstract":"With the expansion of the Internet, the grid has become an attractive platform for scientific computing. Java, with a platform-independent execution model and built-in support for distributed computing is an inviting choice for implementation of applications intended for grid execution. Recent work has shown that an accurate performance model combined with a load-balancing scheduling strategy can significantly improve the performance of distributed applications on a heterogeneous computing platform, such as the grid. However, current performance modeling techniques are not suitable for Java applications, as the virtual machine execution model presents several difficulties: 1) a significant amount of time is spent on compilation at the beginning of the execution, 2) the virtual machine continuously profiles and recompiles the code during the execution, 3) garbage collection can have unpredictable effects on memory hierarchy, 4) some applications can spend more time garbage collecting than computing for certain heap sizes and 5) small variations in virtual machine implementation can have a large impact on the application's behavior. In this paper, we present a practical profile-based strategy for performance modeling of Java scientific applications intended for execution on the grid. We introduce two novel concepts for the Java execution model: point of predictability (PoP) and point of unpredictability (PoU). PoP accounts for the volatile nature of the effects of the virtual machine on execution time for small problem sizes. PoU accounts for the effects of garbage collection on certain applications that have a memory footprint that approaches the total heap size. We present an algorithm for determining PoP and PoU for Java applications, given the hardware platform, virtual machine and heap size. We also present a code-instrumentation-based mechanism for building the algorithm complexity model for a given application. We introduce a technique for calibrating this model that is able to accurately predict the execution time of Java programs for problem sizes between PoP and PoU. Our preliminary experiments show that techniques can achieve load balancing with more than 90% average CPU utilization.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130391958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620792
Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar
Due to the long simulation times of the reference input sets, microarchitects resort to alternative techniques to speed up cycle-accurate simulations. However, the reduction in the runtimes comes with an associated loss of accuracy in replicating the characteristics of the reference sets. In addition, the effect of these inaccuracies on the overall performance can vary across different microarchitecture optimizations or enhancements. In this work, we study and compare two such techniques, reduced input sets and statistical sampling, in the context of microarchitecture-aware floorplanning, a physical design stage, where the objective is to find an IPC-optimal global placement of the blocks of a microprocessor. The variation in the IPC results due the insertion of additional flip-flops on some across-chip wires of the processor that have multicycle delays in nanometer technology nodes. The objective of IPC-aware floorplanning is to minimize the amount of pipelining required by the system buses that are critical in determining the system performance. Our results indicate that, although the two techniques exhibit contrasting behavior in quantifying the criticality of bus latencies, the ensuing floorplanning optimization process results in almost identical performance improvements for both reduced input sets and sampling. The reason behind this is that, for discrete optimization problems such as IPC-aware floorplanning, a reasonably accurate relative ordering of performance bottlenecks is sufficient, absolute accuracy is not necessary.
{"title":"Comparing simulation techniques for microarchitecture-aware floorplanning","authors":"Vidyasagar Nookala, Ying Chen, D. Lilja, S. Sapatnekar","doi":"10.1109/ISPASS.2006.1620792","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620792","url":null,"abstract":"Due to the long simulation times of the reference input sets, microarchitects resort to alternative techniques to speed up cycle-accurate simulations. However, the reduction in the runtimes comes with an associated loss of accuracy in replicating the characteristics of the reference sets. In addition, the effect of these inaccuracies on the overall performance can vary across different microarchitecture optimizations or enhancements. In this work, we study and compare two such techniques, reduced input sets and statistical sampling, in the context of microarchitecture-aware floorplanning, a physical design stage, where the objective is to find an IPC-optimal global placement of the blocks of a microprocessor. The variation in the IPC results due the insertion of additional flip-flops on some across-chip wires of the processor that have multicycle delays in nanometer technology nodes. The objective of IPC-aware floorplanning is to minimize the amount of pipelining required by the system buses that are critical in determining the system performance. Our results indicate that, although the two techniques exhibit contrasting behavior in quantifying the criticality of bus latencies, the ensuing floorplanning optimization process results in almost identical performance improvements for both reduced input sets and sampling. The reason behind this is that, for discrete optimization problems such as IPC-aware floorplanning, a reasonably accurate relative ordering of performance bottlenecks is sufficient, absolute accuracy is not necessary.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127210181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620790
G. Loh
Branch predictors play a critical role in the performance of modern processors, and the prediction accuracy is known to be the most important attribute of such predictors. However, the latency of the predictor can also have a profound impact on performance as well. In past studies that have considered branch prediction latency, most only consider the latency required to make a prediction. However, in deeply pipelined processors, the latency between prediction and update can also greatly affect performance. In this study, we revisit the performance impact of both of these latencies and demonstrate that update latency can also have a significant impact on performance. We then describe two techniques, multi-overriding and hierarchical updates, to address both latencies which provide 4.4% and 5.7% IPC improvements on moderately (20-stage) and deeply (40-stage) pipelined processors, respectively, for minimal hardware complexity.
{"title":"Revisiting the performance impact of branch predictor latencies","authors":"G. Loh","doi":"10.1109/ISPASS.2006.1620790","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620790","url":null,"abstract":"Branch predictors play a critical role in the performance of modern processors, and the prediction accuracy is known to be the most important attribute of such predictors. However, the latency of the predictor can also have a profound impact on performance as well. In past studies that have considered branch prediction latency, most only consider the latency required to make a prediction. However, in deeply pipelined processors, the latency between prediction and update can also greatly affect performance. In this study, we revisit the performance impact of both of these latencies and demonstrate that update latency can also have a significant impact on performance. We then describe two techniques, multi-overriding and hierarchical updates, to address both latencies which provide 4.4% and 5.7% IPC improvements on moderately (20-stage) and deeply (40-stage) pipelined processors, respectively, for minimal hardware complexity.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"227 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122692959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620802
Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti
Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes - sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching - in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interferes with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.
{"title":"Friendly fire: understanding the effects of multiprocessor prefetches","authors":"Natalie D. Enright Jerger, Eric L. Hill, Mikko H. Lipasti","doi":"10.1109/ISPASS.2006.1620802","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620802","url":null,"abstract":"Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes - sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching - in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interferes with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"56 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120839371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620803
Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang
The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.
MESA (Multicoloring with Embedded biased Associativity)是一种新的缓存索引方案,它将动态页面着色与静态倾斜关联性相结合,以减少小关联度的L2/L3缓存中的冲突。MESA将多个缓存页面(颜色)与每个虚拟内存页面相关联,并使用两级倾斜的关联性,首先将页面映射到缓存的每个银行中的不同颜色,然后将页面的线条分散到各个银行和页面的颜色中。MESA是一种多粒度缓存索引方案,它结合了两个世界的优点:页面着色和倾斜关联。我们还提出了一种新的基于页面重映射的缓存管理方案,该方案使用每个银行中颜色之间的缓存丢失不平衡作为度量来跟踪冲突并触发重映射。我们在有序问题处理器(使用完整的系统模拟)和无序问题处理器(使用SimpleScalar)上使用来自多个应用领域的24个基准测试来评估MESA,并对冲突缺失具有不同程度的敏感性。MESA优于之前提出的歪斜结合性、素模哈希和动态页面着色方案。与4路关联缓存相比,MESA可以提供多达76%的IPC改进。
{"title":"MESA: reducing cache conflicts by integrating static and run-time methods","authors":"Xiaoning Ding, Dimitrios S. Nikolopoulos, Song Jiang, Xiaodong Zhang","doi":"10.1109/ISPASS.2006.1620803","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620803","url":null,"abstract":"The paper proposes MESA (Multicoloring with Embedded Skewed Associativity), a novel cache indexing scheme that integrates dynamic page coloring with static skewed associativity to reduce conflicts in L2/L3 caches with a small degree of associativity. MESA associates multiple cache pages (colors) with each virtual memory page and uses two-level skewed associativity, first to map a page to a different color in each bank of the cache, and then to disperse the lines of a page across the banks and within the colors of the page. MESA is a multi-grained cache indexing scheme that combines the best of two worlds, page coloring and skewed associativity. We also propose a novel cache management scheme based on page remapping, which uses cache miss imbalance between colors in each bank as the metric to track conflicts and trigger remapping. We evaluate MESA using 24 benchmarks from multiple application domains and with various degrees of sensitivity to conflict misses, on both an in-order issue processor (using complete system simulation) and an out-of-order issue processor (using SimpleScalar). MESA outperforms skewed associativity, prime modulo hashing, and dynamic page coloring schemes proposed earlier. Compared to a 4-way associative cache, MESA can provide as much as 76% improvement in IPC.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"221 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131611406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620801
H. Al-Sukhni, James Holt, D. Connors
Stride-based prefetching mechanisms exploit regular streams of memory accesses to hide memory latency. While these mechanisms are effective, they can be improved by studying the properties of regular streams. As evidence of this, the establishment of metrics to quantify intrinsic characteristics of regular streams has been shown to enable software-based code optimizations. In this paper we extend previously identified regular stream metrics to quantify extrinsic characteristics of regular streams, and show how these new metrics can be employed to improve the efficiency of stride prefetching. The extrinsic metrics we introduce are stream affinity and stream density. Stream affinity enables prefetching for short streams that were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Finally, we show that using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling prefetch ahead distance (PAD) which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. For a representative set of SPEC2K traces, our techniques consistently outperform our implementation of the closest previously reported stride-based prefetching technique.
{"title":"Improved stride prefetching using extrinsic stream characteristics","authors":"H. Al-Sukhni, James Holt, D. Connors","doi":"10.1109/ISPASS.2006.1620801","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620801","url":null,"abstract":"Stride-based prefetching mechanisms exploit regular streams of memory accesses to hide memory latency. While these mechanisms are effective, they can be improved by studying the properties of regular streams. As evidence of this, the establishment of metrics to quantify intrinsic characteristics of regular streams has been shown to enable software-based code optimizations. In this paper we extend previously identified regular stream metrics to quantify extrinsic characteristics of regular streams, and show how these new metrics can be employed to improve the efficiency of stride prefetching. The extrinsic metrics we introduce are stream affinity and stream density. Stream affinity enables prefetching for short streams that were previously ignored by stride prefetching mechanisms. Stream density enables a prioritization mechanism that dynamically selects amongst available streams in favor of those that promise more miss coverage, and provides thrashing control amongst several coexisting streams. Finally, we show that using intrinsic and extrinsic stream metrics in combination allows a novel hardware technique for controlling prefetch ahead distance (PAD) which dynamically adjusts the prefetch launch time to better enable timely prefetches while minimizing cache pollution. For a representative set of SPEC2K traces, our techniques consistently outperform our implementation of the closest previously reported stride-based prefetching technique.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125863726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620788
R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler
Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.
{"title":"Critical path analysis of the TRIPS architecture","authors":"R. Nagarajan, Xia Chen, Robert G. McDonald, D. Burger, S. Keckler","doi":"10.1109/ISPASS.2006.1620788","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620788","url":null,"abstract":"Fast, accurate, and effective performance analysis is essential for the design of modern processor architectures and improving application performance. Recent trends toward highly concurrent processors make this goal increasingly difficult. Conventional techniques, based on simulators and performance monitors, are ill-equipped to analyze how a plethora of concurrent events interact and how they affect performance. Prior research has shown the utility of critical path analysis in solving this problem. This analysis abstracts the execution of a program with a dependence graph. With simple manipulations on the graph, designers can gain insights into the bottlenecks of a design. This paper extends critical path analysis to understand the performance of a next-generation, high-ILP architecture. The TRIPS architecture introduces new features not present in conventional superscalar architectures. We show how dependence constraints introduced by these features, specifically the execution model and operand communication links, can be modeled with a dependence graph. We describe a new algorithm that tracks critical path information at a fine-grained level and yet can deliver an order of magnitude (30x) improvement in performance over previously proposed techniques. Finally, we provide a breakdown of the critical path for a select set of benchmarks and show an example where we use this information to improve the performance of a heavily-hand-optimized program by as much as 11%.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126023281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620807
Victor Moya Del Barrio, Carlos González, Jordi Roca, Agustín Fernández, R. Espasa
The present work presents a cycle-level execution-driven simulator for modern GPU architectures. We discuss the simulation model used for our GPU simulator, based in the concept of boxes and signals, and the relation between the timing simulator and the functional emulator. The simulation model we use helps to increase the accuracy and reduce the number of errors in the timing simulator while allowing for an easy extensibility of the simulated GPU architecture. We also introduce the OpenGL framework used to feed the simulator with traces from real applications (UT2004, Doom3) and a performance debugging tool (Signal Trace Visualizer). The presented ATTILA simulator supports the simulation of a whole range of GPU configurations and architectures, from the embedded segment to the high end PC segment, supporting both the unified and non unified shader architectural models.
{"title":"ATTILA: a cycle-level execution-driven simulator for modern GPU architectures","authors":"Victor Moya Del Barrio, Carlos González, Jordi Roca, Agustín Fernández, R. Espasa","doi":"10.1109/ISPASS.2006.1620807","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620807","url":null,"abstract":"The present work presents a cycle-level execution-driven simulator for modern GPU architectures. We discuss the simulation model used for our GPU simulator, based in the concept of boxes and signals, and the relation between the timing simulator and the functional emulator. The simulation model we use helps to increase the accuracy and reduce the number of errors in the timing simulator while allowing for an easy extensibility of the simulated GPU architecture. We also introduce the OpenGL framework used to feed the simulator with traces from real applications (UT2004, Doom3) and a performance debugging tool (Signal Trace Visualizer). The presented ATTILA simulator supports the simulation of a whole range of GPU configurations and architectures, from the embedded segment to the high end PC segment, supporting both the unified and non unified shader architectural models.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128372542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2006-03-19DOI: 10.1109/ISPASS.2006.1620796
B. Agrawal, T. Sherwood
Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.
{"title":"Modeling TCAM power for next generation network devices","authors":"B. Agrawal, T. Sherwood","doi":"10.1109/ISPASS.2006.1620796","DOIUrl":"https://doi.org/10.1109/ISPASS.2006.1620796","url":null,"abstract":"Applications in computer networks often require high throughput access to large data structures for lookup and classification. Many advanced algorithms exist to speed these search primitives on network processors, general purpose machines, and even custom ASICs. However, supporting these applications with standard memories requires very careful analysis of access patterns, and achieving worst case performance can be quite difficult and complex. A simple solution is often possible if a Ternary CAM (content addressable memory) is used to perform a fully parallel search across the entire data set. Unfortunately, this parallelism means that large portions of the chip are switching during each cycle, causing large amounts of power to be consumed. While researchers have begun to explore new ways of managing the power consumption, quantifying design alternatives is difficult due to a lack of available models. In this paper, we examine the structure inside a modern TCAM and present a simple, yet accurate, power model. We present techniques to estimate the dynamic power consumption of a large TCAM. We validate the model using industrial TCAM datasheets and prior published works. We present an extensive analysis of the model by varying various architectural parameters. We also describe how new network algorithms have the potential to address the growing problem of power management in next-generation network devices.","PeriodicalId":369192,"journal":{"name":"2006 IEEE International Symposium on Performance Analysis of Systems and Software","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125717758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}